Algorithm: finding the number contained within most intervals

artwaw

@VRonin Silly idea - own container class that acts as std::vector you showed but internally computes the interval on insert and internally sorts by it on demand or at every insert, depending what you need?

JonB

@VRonin

Create array of counters [INT_MIN .. INT_MAX]
Loop from INT_MIN to INT_MAX
For each loop counter value go through list and store in array how many it falls within.
Find array element with maximum value.

;-)

I'm thinking about this.....

JonB

@VRonin
How generic do you want to be:

How many range items might be in your outer vector? 10? 10 million?
You say "min and max are not usually very far apart" but that is unclear. It is vital to understand whether that means in any given pair the range will be < 50 or the range from the lowest min anywhere to the highest max anywhere, including in different pairs, will be < 50?
Relevant to that, what is the range between the lowest min anywhere to the highest max anywhere? If that is not 50 can it be 50 million?

Both you and @artwaw seem to like

create another std:vector<int>, compute all intervals and put there, then sort this instead...

but I do not know exactly what you mean by this, that is correct for the question?

VRonin

@JonB said in Algorithm: finding the number contained within most intervals:

How many range items might be in your outer vector? 10? 10 million?

I'd say typical use is <=10

You say "min and max are not usually very far apart" but that is unclear. It is vital to understand whether that means in any given pair the range will be < 50 or the range from the lowest min anywhere to the highest max anywhere, including in different pairs, will be < 50?
Relevant to that, what is the range between the lowest min anywhere to the highest max anywhere? If that is not 50 can it be 50 million?

all lower bounds of the ranges are >=0
all upper bounds of the ranges are <= a known upper limit. that limit is realistically <=50

This is my current solution. It works but it look braindead to me

#include <iostream>
#include <vector>
#include <algorithm>
#include <memory>

int numberInMostIntervals(const std::vector<std::pair<int,int>>& intervals, int upperLimit){
    if(intervals.empty())
        return -1;
    if(upperLimit==0)
        return 0;
    std::unique_ptr<int[]> counters(new int[upperLimit+1]);
    std::fill(counters.get(), counters.get()+upperLimit+1, 0);
    for(const std::pair<int,int>& i : intervals){
        for(int j=i.first;j<=i.second;++j)
            counters[j]++;
    }
    return std::distance(counters.get(),std::max_element(counters.get(),counters.get()+upperLimit+1));
}
int numberInMostIntervals(const std::vector<std::pair<int,int>>& intervals){
    if(intervals.empty())
        return -1;
    return numberInMostIntervals(intervals,std::max_element(intervals.cbegin(),intervals.cend(),
        [](const std::pair<int,int>& a,const std::pair<int,int>& b)->bool{return a.second<b.second;})->second
    );
}

int main()
{
    std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{std::make_pair(1,4),std::make_pair(0,2),std::make_pair(2,7)},8) << std::endl;
    std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{std::make_pair(1,4),std::make_pair(0,2),std::make_pair(5,7)},8) << std::endl;
    std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{std::make_pair(1,4)},8) << std::endl;
    std::cout<<numberInMostIntervals(std::vector<std::pair<int,int>>{},8) << std::endl;
    return 0;
}

VRonin

I asked chatGPT and it came back with a more verbose version of the same thing. That means I'm as dumb as chatGPT. There MUST be a smarter way

int numberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit) {
    std::vector<int> frequency(upperLimit + 1, 0);

    // Increment frequency for each interval
    for (const auto& interval : intervals) {
        int start = std::max(0, interval.first); // Make sure start is not below 0
        int end = std::min(upperLimit, interval.second); // Make sure end is not above upperLimit

        for (int i = start; i <= end; ++i) {
            ++frequency[i];
        }
    }

    // Find the number with the highest frequency
    int maxCount = 0;
    int result = 0;

    for (int i = 0; i <= upperLimit; ++i) {
        if (frequency[i] > maxCount) {
            maxCount = frequency[i];
            result = i;
        }
    }

    return result;
}

J.Hilk

@VRonin So I asked chatGPT to make that answer faster end smarter....
It eliminated a for loop

int numberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit) {
    std::vector<int> frequency(upperLimit + 2, 0); // Use +2 to handle boundary at upperLimit properly

    // Mark the start and end points for each interval
    for (const auto& interval : intervals) {
        int start = std::max(0, interval.first); // Make sure start is not below 0
        int end = std::min(upperLimit, interval.second); // Make sure end is not above upperLimit

        ++frequency[start];
        if (end + 1 <= upperLimit) {
            --frequency[end + 1];
        }
    }

    // Apply the prefix sum technique to get the frequency of each number
    int maxCount = 0;
    int result = 0;
    int currentCount = 0;

    for (int i = 0; i <= upperLimit; ++i) {
        currentCount += frequency[i];
        if (currentCount > maxCount) {
            maxCount = currentCount;
            result = i;
        }
    }

    return result;
}

🤷‍♂️

VRonin

@J-Hilk said in Algorithm: finding the number contained within most intervals:

It eliminated a for loop

with the added benefit of now not working anymore. I LOVE IT!

JonB

Well I don't follow how or what it's doing, I have only glanced and not tried.

If I understand right, both sets of code declare an array something like std::vector<int> frequency(upperLimit). And then count through every number for its "frequency". If upperLimit is LARGE that means (a) a big array (lots of memory) and (b) (at least) O(upperLimit) iteration (slow). I would like to avoid both of these.

I have an algorithm in mind, just by thinking about it. I have not written or tested it. It may be flawed, in that it would not necessarily give the correct answer, in which case it's obviously unacceptable.

Here is my thought.

One assumption I make is that for the final "result" value I will only consider/return a value equal to (any one of) all the minimum values across all the minima in the pairs. I claim no other number can lie in the ranges more times than that.

Then I am thinking of approaching like this:

Create one vector to hold all the minimum values across all the pairs. (It doesn't matter whether you extract just the int or do it with the pairs.) Sort than ascending.
Create another vector to hold all the maximum values values across all the pairs. Sort that ascending too. Obviously I am aware this vector will not necessarily be in the same order as the minima vector, by which I mean maxima[0] is not necessarily the "paired" maximum for minima[0]. E.g. if pair[0] is (0, 50) and pair[1] is (10, 40) then minima[0] == 0 and maxima[0] is 40, they do not come from the same pair. I claim my algorithm is OK with this.
You note that/keep the incrementing counters in their own frequency[] container. You only need entries for each minima[] element.
Now use separate i (for minima) & j (for maxima) index variables to march through each vector independent of each other.
If you start at i = 0 then minima[i] occurs once so far, you store 1 in its frequency array. When you move to i = 1 increment the counter for minima[0] (i.e. frequency[0]) --- it occurs in 2 ranges so far --- and create a new frequent entry initialised at 1 for this second entry. And so on through the minima array.
At the same time as you are marching i through minima[i] values you also march j through the maxima array. Your decide whether to look at minima[i] next versus maxima[j] by whichever value is lower.
When you reach a maxima[j] you count one less from then on for all minima[k] where minima[k] < maxima[j]. You know that maxima[0] >= minima[0]. So once you meet some maxima[j] which is less than minima[1] you no longer want to add 1 the "frequency" for minima[0] from then on, because it's excluded by the maxima[j].

If this all works, then

You only need a frequency[] array with as many elements as the minima/pairs array length. That is like 10 or 50, where if the largest maximum is 1 billion you have saved a lot of memory!
You only to examine as many elements as are in the minima + maxima arrays together, i.e. twice the number of elements in the pair array. That is like 20 or 100, where if the number of pairs elements is 2 billion you have saved a lot of iterations!

I'm thinking my space usage is something like O(pairs-elements * 3) and my iteration time usage O(pairs-elements * 2). Give or take. Which is a awful lot less than O(upper-limit) for both/either where upper-limit or number-of-pairs =~ 1,000,000.

Would you like to either prove my approach is incorrect or have a go at implementing it, and test with a million elements and upper limit on maxima? :D

And if perchance I am right you know not to bother ChatGPT from now on... ;-)

P.S.
I'm not sure, but maybe the ChatGPT algorithm's first interval loop is indeed of acceptable time O(number-of-pairs). But (a) its frequency(upperLimit) array is "large" space and (b) its final for (i <= upperLimit is "large" iterations. Go ask ChatGPT what O() the execution times and space requirements of its algorithm are when upperLimit and number-of-pairs are each a billion? :)

JonB

@VRonin , @J-Hilk
So nobody wanted to examine/implement my suggested algorithm, shame.... :( ;-)

So.... I sat down this morning and provided an implementation of my approach. Here is complete, standalone code (C++ 17+): it has a main() which generates a bunch of the min-max range pairs randomly, calls the ChatGPT best implementation (i.e. @VRonin's second one) and my "jon" implementation (extensively commented), and reports the answer and the time for each.

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

std::pair<int, int> chatGPTNumberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit)
{
    if (intervals.empty())
        return {-1, -1};

    std::vector<int> frequency(upperLimit + 2, 0); // Use +2 to handle boundary at upperLimit properly

    // Mark the start and end points for each interval
    for (const auto& interval : intervals) {
        ++frequency[interval.first];
        --frequency[interval.second + 1];
    }

    // Apply the prefix sum technique to get the frequency of each number
    int maxCount = 0;
    int result = 0;
    int currentCount = 0;

    for (int i = 0; i <= upperLimit; ++i) {
        currentCount += frequency[i];
        if (currentCount > maxCount) {
            maxCount = currentCount;
            result = i;
        }
    }

    return {result, maxCount};
}


std::pair<int, int> jonNumberInMostIntervals(const std::vector<std::pair<int, int>>& intervals, int upperLimit)
{
    if (intervals.empty())
        return {-1, -1};

    // vector of size intervals.size()
    // the pair<int, int> elements will hold <min-bound, count-of-min-bound> as each eleemnt
    std::vector<std::pair<int, int>> frequency(intervals.size());

    // fill frequency[] with the min-bounds in each first-member, count in second-member doesn't matter here
    for (int i = 0; i < intervals.size(); i++)
        frequency[i] = {intervals[i].first, 0 };

    // sort frequency[] by the min-bounds value in each element, ascending
    std::sort(frequency.begin(), frequency.end(),
              [](const std::pair<int, int> &x, const std::pair<int, int> &y)
              {
                  return x.first < y.first;
              });

    // fill the sorted frequency[] count-of-min-bound in each element
    // frequency[0].count = 1, frequency[1].count = 2, frequency[i].count = i + 1
    for (int i = 0; i < frequency.size(); i++)
        frequency[i].second = i + 1;
    
    // go through the interval[] pair elements (random order for max range values)
    //   go *down* through the frequency[] pair elements (sorted by min range values ascending)
    //     while the freqency min values are greater than the interval max value decrement that element's frequency count
    for (int i = 0; i < intervals.size(); i++)
        for (int j = frequency.size() - 1; j >= 0 && frequency[j].first > intervals[i].second ; j--)
            frequency[j].second--;
    
    // find the frequency[] pair element with the greatest count
    auto most_frequent = std::max_element(frequency.begin(), frequency.end(),
                                          [](const std::pair<int, int> &x, const std::pair<int, int> &y)
                                          {
                                              return x.second < y.second;
                                          });

    return *most_frequent;
}


int main()
{
    constexpr int upperLimit = 1000000000;
    constexpr int num_intervals = 10;
    std::vector<std::pair<int, int>> intervals;

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<int> distrib(0, 100);
    for (int i = 0; i < num_intervals; i++) {
        std::pair<int, int> pair(distrib(gen), distrib(gen));
        if (pair.first > pair.second)
            std::swap(pair.first, pair.second);
        intervals.push_back(pair);
        std::cout << "(" << pair.first << ", " << pair.second << ")" << std::endl;
    }

    uint64_t start_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
    std::pair<int, int> result = chatGPTNumberInMostIntervals(intervals, upperLimit);
    uint64_t end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
    std::cout << "ChatGPT: " << "Number: " << result.first << ", Frequency: " << result.second << ", Time: " << end_time - start_time << std::endl;

    start_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
    result = jonNumberInMostIntervals(intervals, upperLimit);
    end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
    std::cout << "jon: " << "Number: " << result.first << ", Frequency: " << result.second << ", Time: " << end_time - start_time << std::endl;

    return 0;
}

I am deliberately testing with 10 interval ranges but 1 billion(!) as upperLimit, i.e. large.

In that case ChatGPT uses (a) a large amount of memory (its frequency[] vector is size 1 billion elements) and (b) a large amount of iterations/time (it looks through the 1 billion elements in the vector). OTOH mine uses (a) small memory (frequency[] vector is same size as intervals[] vector) and (b) small amount of iterations/time (it looks through just the intervals[]/frequency[] vector elements).

Here is the output compiled and run for debug:

ChatGPT: Number: 17, Frequency: 6, Time: 6391
jon: Number: 17, Frequency: 6, Time: 0

and here compiled and run for release:

ChatGPT: Number: 45, Frequency: 8, Time: 2355
jon: Number: 45, Frequency: 8, Time: 0

So, I don't mean to blow my trumpet, but mine is somewhere between thousands and "infinitely" faster than the ChatGPT one, and uses like a hundred-millionth of the memory, at least with a "large" upperLimit... ;-) I realise it won't make so much difference with a "smaller" upper limit, and that is your RL case, but still....

@VRonin
Having taken the time to produce this to address your "what I'm doing now but feels sooooo inefficient", I hope you might take a look it/test t for yourself.

My conclusion: Mine is more work figuring it than copying a (decent) solution from ChatGPT, but in view of the vast efficiency and space improvements I feel I do not yet need to hang up my programming clogs, I feel I am not replaced by ChatGPT, yet... :D

P.S.
In the interests of clarity/fairness I must admit where my algorithm is not so good. As you increase the num_intervals mine will use more memory for the frequency[] array and take more time through its loops/sorting, where ChatGPT's will be relatively unaffected. Basically, for both speed and space, ChatGPT's is affected by upperLimit size while mine is affected by the size of num_intervals. So the "best" depends on whether you have a large range-bound in your ranges or a large number of ranges.

SimonSchroeder

Here is another idea: Treat the intervals as sets and compute their intersections.

Start with the first interval and compute the intersection with every other interval. Store the intersections inside a second vector. The second vector should now have n-1 elements (intersection of 1st with 2nd, 1st with 3rd, 1st with 4th, ..., 1st with nth, but not 1st with 1st). Now you take the intervals in this second vector and intersect them with all original intervals again (first interval in the second vector is for 1st with 2nd, so we only need to start intersecting starting from 3rd, for the second interval we start from the 4th interval, and so on). The intersections are then stored inside a third vector. If there are only empty intervals inside the third vector, we can stop for now. Otherwise we throw away the second vector and repeat this process with the third vector. (The second vector is intersections containing 2 intervals, the third vector contains intersections of 3 intervals, so we don't need the second vector anymore if there are non-empty intervals in the third vector.) From the last vector we get from this process we keep one intersected interval (just take the first non-empty intersection) and remember it. It is the intersection with the most intervals including the first interval. Now, we repeat this for the second interval, and so on.

This should give as for each interval the maximum of intersections containing this interval and the interval limits of the intersection. You can just pick the intersected interval with the highest number of intersections and use the minimum of that interval as the number.

It is a little tricky to wrap your head around this (initially I thought it would be little easier). The way I described it runtime would be at least O(n³). I'm not sure if this is good or bad.

Algorithm: finding the number contained within most intervals

Problem

Notes for optimisation

Examples