Algorithm: finding the number contained within most intervals
-
- Create array of counters
[INT_MIN .. INT_MAX]
- Loop from
INT_MIN
toINT_MAX
- For each loop counter value go through list and store in array how many it falls within.
- Find array element with maximum value.
;-)
I'm thinking about this.....
- Create array of counters
-
@VRonin
How generic do you want to be:- How many range items might be in your outer vector? 10? 10 million?
- You say "min and max are not usually very far apart" but that is unclear. It is vital to understand whether that means in any given pair the range will be < 50 or the range from the lowest min anywhere to the highest max anywhere, including in different pairs, will be < 50?
- Relevant to that, what is the range between the lowest min anywhere to the highest max anywhere? If that is not 50 can it be 50 million?
Both you and @artwaw seem to like
create another std:vector<int>, compute all intervals and put there, then sort this instead...
but I do not know exactly what you mean by this, that is correct for the question?
-
@JonB said in Algorithm: finding the number contained within most intervals:
How many range items might be in your outer vector? 10? 10 million?
I'd say typical use is <=10
You say "min and max are not usually very far apart" but that is unclear. It is vital to understand whether that means in any given pair the range will be < 50 or the range from the lowest min anywhere to the highest max anywhere, including in different pairs, will be < 50?
Relevant to that, what is the range between the lowest min anywhere to the highest max anywhere? If that is not 50 can it be 50 million?all lower bounds of the ranges are >=0
all upper bounds of the ranges are <= a known upper limit. that limit is realistically <=50This is my current solution. It works but it look braindead to me
#include <iostream> #include <vector> #include <algorithm> #include <memory> int numberInMostIntervals(const std::vector<std::pair<int,int>>& intervals, int upperLimit){ if(intervals.empty()) return -1; if(upperLimit==0) return 0; std::unique_ptr<int[]> counters(new int[upperLimit+1]); std::fill(counters.get(), counters.get()+upperLimit+1, 0); for(const std::pair<int,int>& i : intervals){ for(int j=i.first;j<=i.second;++j) counters[j]++; } return std::distance(counters.get(),std::max_element(counters.get(),counters.get()+upperLimit+1)); } int numberInMostIntervals(const std::vector<std::pair<int,int>>& intervals){ if(intervals.empty()) return -1; return numberInMostIntervals(intervals,std::max_element(intervals.cbegin(),intervals.cend(), [](const std::pair<int,int>& a,const std::pair<int,int>& b)->bool{return a.second<b.second;})->second ); } int main() { std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{std::make_pair(1,4),std::make_pair(0,2),std::make_pair(2,7)},8) << std::endl; std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{std::make_pair(1,4),std::make_pair(0,2),std::make_pair(5,7)},8) << std::endl; std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{std::make_pair(1,4)},8) << std::endl; std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{},8) << std::endl; return 0; }
-
I asked chatGPT and it came back with a more verbose version of the same thing. That means I'm as dumb as chatGPT. There MUST be a smarter way
int numberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit) { std::vector<int> frequency(upperLimit + 1, 0); // Increment frequency for each interval for (const auto& interval : intervals) { int start = std::max(0, interval.first); // Make sure start is not below 0 int end = std::min(upperLimit, interval.second); // Make sure end is not above upperLimit for (int i = start; i <= end; ++i) { ++frequency[i]; } } // Find the number with the highest frequency int maxCount = 0; int result = 0; for (int i = 0; i <= upperLimit; ++i) { if (frequency[i] > maxCount) { maxCount = frequency[i]; result = i; } } return result; }
-
@VRonin So I asked chatGPT to make that answer faster end smarter....
It eliminated a for loopint numberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit) { std::vector<int> frequency(upperLimit + 2, 0); // Use +2 to handle boundary at upperLimit properly // Mark the start and end points for each interval for (const auto& interval : intervals) { int start = std::max(0, interval.first); // Make sure start is not below 0 int end = std::min(upperLimit, interval.second); // Make sure end is not above upperLimit ++frequency[start]; if (end + 1 <= upperLimit) { --frequency[end + 1]; } } // Apply the prefix sum technique to get the frequency of each number int maxCount = 0; int result = 0; int currentCount = 0; for (int i = 0; i <= upperLimit; ++i) { currentCount += frequency[i]; if (currentCount > maxCount) { maxCount = currentCount; result = i; } } return result; }
🤷♂️
-
Well I don't follow how or what it's doing, I have only glanced and not tried.
If I understand right, both sets of code declare an array something like
std::vector<int> frequency(upperLimit)
. And then count through every number for its "frequency". IfupperLimit
is LARGE that means (a) a big array (lots of memory) and (b) (at least)O(upperLimit)
iteration (slow). I would like to avoid both of these.I have an algorithm in mind, just by thinking about it. I have not written or tested it. It may be flawed, in that it would not necessarily give the correct answer, in which case it's obviously unacceptable.
Here is my thought.
One assumption I make is that for the final "result" value I will only consider/return a value equal to (any one of) all the minimum values across all the minima in the pairs. I claim no other number can lie in the ranges more times than that.
Then I am thinking of approaching like this:
-
Create one vector to hold all the minimum values across all the pairs. (It doesn't matter whether you extract just the
int
or do it with the pairs.) Sort than ascending. -
Create another vector to hold all the maximum values values across all the pairs. Sort that ascending too. Obviously I am aware this vector will not necessarily be in the same order as the minima vector, by which I mean
maxima[0]
is not necessarily the "paired" maximum forminima[0]
. E.g. ifpair[0]
is(0, 50)
andpair[1]
is(10, 40)
thenminima[0] == 0
andmaxima[0]
is40
, they do not come from the same pair. I claim my algorithm is OK with this. -
You note that/keep the incrementing counters in their own
frequency[]
container. You only need entries for eachminima[]
element. -
Now use separate
i
(for minima) &j
(for maxima) index variables to march through each vector independent of each other. -
If you start at
i = 0
thenminima[i]
occurs once so far, you store 1 in its frequency array. When you move toi = 1
increment the counter forminima[0]
(i.e.frequency[0]
) --- it occurs in 2 ranges so far --- and create a new frequent entry initialised at 1 for this second entry. And so on through the minima array. -
At the same time as you are marching
i
throughminima[i]
values you also marchj
through the maxima array. Your decide whether to look atminima[i]
next versusmaxima[j]
by whichever value is lower. -
When you reach a
maxima[j]
you count one less from then on for allminima[k]
whereminima[k] < maxima[j]
. You know thatmaxima[0] >= minima[0]
. So once you meet somemaxima[j]
which is less thanminima[1]
you no longer want to add 1 the "frequency" forminima[0]
from then on, because it's excluded by themaxima[j]
.
If this all works, then
- You only need a
frequency[]
array with as many elements as the minima/pairs array length. That is like 10 or 50, where if the largest maximum is 1 billion you have saved a lot of memory! - You only to examine as many elements as are in the minima + maxima arrays together, i.e. twice the number of elements in the pair array. That is like 20 or 100, where if the number of pairs elements is 2 billion you have saved a lot of iterations!
I'm thinking my space usage is something like
O(pairs-elements * 3)
and my iteration time usageO(pairs-elements * 2)
. Give or take. Which is a awful lot less thanO(upper-limit)
for both/either where upper-limit or number-of-pairs =~ 1,000,000.Would you like to either prove my approach is incorrect or have a go at implementing it, and test with a million elements and upper limit on maxima? :D
And if perchance I am right you know not to bother ChatGPT from now on... ;-)
P.S.
I'm not sure, but maybe the ChatGPT algorithm's firstinterval
loop is indeed of acceptable timeO(number-of-pairs)
. But (a) itsfrequency(upperLimit)
array is "large" space and (b) its finalfor (i <= upperLimit
is "large" iterations. Go ask ChatGPT whatO()
the execution times and space requirements of its algorithm are whenupperLimit
and number-of-pairs are each a billion? :) -
-
@VRonin , @J-Hilk
So nobody wanted to examine/implement my suggested algorithm, shame.... :( ;-)So.... I sat down this morning and provided an implementation of my approach. Here is complete, standalone code (C++ 17+): it has a
main()
which generates a bunch of the min-max range pairs randomly, calls the ChatGPT best implementation (i.e. @VRonin's second one) and my "jon" implementation (extensively commented), and reports the answer and the time for each.#include <algorithm> #include <chrono> #include <iostream> #include <random> #include <vector> std::pair<int, int> chatGPTNumberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit) { if (intervals.empty()) return {-1, -1}; std::vector<int> frequency(upperLimit + 2, 0); // Use +2 to handle boundary at upperLimit properly // Mark the start and end points for each interval for (const auto& interval : intervals) { ++frequency[interval.first]; --frequency[interval.second + 1]; } // Apply the prefix sum technique to get the frequency of each number int maxCount = 0; int result = 0; int currentCount = 0; for (int i = 0; i <= upperLimit; ++i) { currentCount += frequency[i]; if (currentCount > maxCount) { maxCount = currentCount; result = i; } } return {result, maxCount}; } std::pair<int, int> jonNumberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit) { if (intervals.empty()) return {-1, -1}; // vector of size intervals.size() // the pair<int, int> elements will hold <min-bound, count-of-min-bound> as each eleemnt std::vector<std::pair<int, int>> frequency(intervals.size()); // fill frequency[] with the min-bounds in each first-member, count in second-member doesn't matter here for (int i = 0; i < intervals.size(); i++) frequency[i] = {intervals[i].first, 0 }; // sort frequency[] by the min-bounds value in each element, ascending std::sort(frequency.begin(), frequency.end(), [](const std::pair<int, int> &x, const std::pair<int, int> &y) { return x.first < y.first; }); // fill the sorted frequency[] count-of-min-bound in each element // frequency[0].count = 1, frequency[1].count = 2, frequency[i].count = i + 1 for (int i = 0; i < frequency.size(); i++) frequency[i].second = i + 1; // go through the interval[] pair elements (random order for max range values) // go *down* through the frequency[] pair elements (sorted by min range values ascending) // while the freqency min values are greater than the interval max value decrement that element's frequency count for (int i = 0; i < intervals.size(); i++) for (int j = frequency.size() - 1; j >= 0 && frequency[j].first > intervals[i].second ; j--) frequency[j].second--; // find the frequency[] pair element with the greatest count auto most_frequent = std::max_element(frequency.begin(), frequency.end(), [](const std::pair<int, int> &x, const std::pair<int, int> &y) { return x.second < y.second; }); return *most_frequent; } int main() { constexpr int upperLimit = 1000000000; constexpr int num_intervals = 10; std::vector<std::pair<int, int>> intervals; std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<int> distrib(0, 100); for (int i = 0; i < num_intervals; i++) { std::pair<int, int> pair(distrib(gen), distrib(gen)); if (pair.first > pair.second) std::swap(pair.first, pair.second); intervals.push_back(pair); std::cout << "(" << pair.first << ", " << pair.second << ")" << std::endl; } uint64_t start_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count(); std::pair<int, int> result = chatGPTNumberInMostIntervals(intervals, upperLimit); uint64_t end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count(); std::cout << "ChatGPT: " << "Number: " << result.first << ", Frequency: " << result.second << ", Time: " << end_time - start_time << std::endl; start_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count(); result = jonNumberInMostIntervals(intervals, upperLimit); end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count(); std::cout << "jon: " << "Number: " << result.first << ", Frequency: " << result.second << ", Time: " << end_time - start_time << std::endl; return 0; }
I am deliberately testing with 10 interval ranges but 1 billion(!) as
upperLimit
, i.e. large.In that case ChatGPT uses (a) a large amount of memory (its
frequency[]
vector is size 1 billion elements) and (b) a large amount of iterations/time (it looks through the 1 billion elements in the vector). OTOH mine uses (a) small memory (frequency[]
vector is same size asintervals[]
vector) and (b) small amount of iterations/time (it looks through just theintervals[]
/frequency[]
vector elements).Here is the output compiled and run for debug:
ChatGPT: Number: 17, Frequency: 6, Time: 6391 jon: Number: 17, Frequency: 6, Time: 0
and here compiled and run for release:
ChatGPT: Number: 45, Frequency: 8, Time: 2355 jon: Number: 45, Frequency: 8, Time: 0
So, I don't mean to blow my trumpet, but mine is somewhere between thousands and "infinitely" faster than the ChatGPT one, and uses like a hundred-millionth of the memory, at least with a "large"
upperLimit
... ;-) I realise it won't make so much difference with a "smaller" upper limit, and that is your RL case, but still....@VRonin
Having taken the time to produce this to address your "what I'm doing now but feels sooooo inefficient", I hope you might take a look it/test t for yourself.My conclusion: Mine is more work figuring it than copying a (decent) solution from ChatGPT, but in view of the vast efficiency and space improvements I feel I do not yet need to hang up my programming clogs, I feel I am not replaced by ChatGPT, yet... :D
P.S.
In the interests of clarity/fairness I must admit where my algorithm is not so good. As you increase thenum_intervals
mine will use more memory for thefrequency[]
array and take more time through its loops/sorting, where ChatGPT's will be relatively unaffected. Basically, for both speed and space, ChatGPT's is affected byupperLimit
size while mine is affected by the size ofnum_intervals
. So the "best" depends on whether you have a large range-bound in your ranges or a large number of ranges. -
Here is another idea: Treat the intervals as sets and compute their intersections.
Start with the first interval and compute the intersection with every other interval. Store the intersections inside a second vector. The second vector should now have n-1 elements (intersection of 1st with 2nd, 1st with 3rd, 1st with 4th, ..., 1st with nth, but not 1st with 1st). Now you take the intervals in this second vector and intersect them with all original intervals again (first interval in the second vector is for 1st with 2nd, so we only need to start intersecting starting from 3rd, for the second interval we start from the 4th interval, and so on). The intersections are then stored inside a third vector. If there are only empty intervals inside the third vector, we can stop for now. Otherwise we throw away the second vector and repeat this process with the third vector. (The second vector is intersections containing 2 intervals, the third vector contains intersections of 3 intervals, so we don't need the second vector anymore if there are non-empty intervals in the third vector.) From the last vector we get from this process we keep one intersected interval (just take the first non-empty intersection) and remember it. It is the intersection with the most intervals including the first interval. Now, we repeat this for the second interval, and so on.
This should give as for each interval the maximum of intersections containing this interval and the interval limits of the intersection. You can just pick the intersected interval with the highest number of intersections and use the minimum of that interval as the number.
It is a little tricky to wrap your head around this (initially I thought it would be little easier). The way I described it runtime would be at least O(n³). I'm not sure if this is good or bad.