Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. The most efficient way to filter a huge amount of data. Mentor needed!
Forum Update on Monday, May 27th 2025

The most efficient way to filter a huge amount of data. Mentor needed!

Scheduled Pinned Locked Moved Unsolved General and Desktop
7 Posts 3 Posters 242 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R Offline
    R Offline
    Rickz
    wrote on last edited by
    #1

    Hello community,
    First of all, I wish you all the best. Here is one beginner, so please do not object to any illogical question. :)
    I want to create a program that will be used to filter certain data.
    The data at this time is in a text file in 15M lines, about 300Mb in size.
    I want to create certain criteria by which I would filter that data regardless of the order in which the filter is turned on.
    The goal is to get one list with all the applied filters.

    Txt file looks like this:

    11 13 24 25 56 78 
    1 12 13 41 45 69 87 
    //more lines
    29 33 37 41 45 55 69
    15 24 39 56 78  81 99 
    eof
    

    One line contains seven numbers separated by space, each line is unique sequence of numbers ordered by
    This is my code so far:

    QVector<int> getVectors (QByteArray s) 
    {
        QVector<int> gIntList;
        for(int i = 0; i < 7; i++)
        {
            QVector<int> intList;
            bool isNum = false;
            int n = (s.split(' ')[i]).toInt(&isNum);
            if(isNum)
                intList << n;
            gIntList.append(intList);
    
        }
        return gIntList;
    }
    
    bool contNo(QVector<int> vec, int no1)
    {
        for (int i =0; i< vec.size(); ++i)
            if(vec.at(i)==no1)
            {
                return true;
            }
    }
    int main(int argc, char *argv[])
    {
        QCoreApplication a(argc, argv);
        QTime myTimer;
        myTimer.start();
    
        QFile file("300mb-15mlines.txt");
        if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
        {
            qInfo()<<"Can't open file";
        }
    
        QByteArray filedata=file.readAll();
        QByteArrayList balist = filedata.split('\n');
        QVector<QVector<int>> allLines;
        for (int j = 0; j<balist.size(); j++)
        {
            QVector<int> comb;
            comb = getVectors(balist.at(j));
            if(contNo(comb,24) && contNo(comb,78) ){
                allLines<<comb;
            }
      }
    
        qInfo()<< allLines;
        file.close();
        int nMilliseconds = myTimer.elapsed();
        qInfo()<< "Time: "<< nMilliseconds / 1000;
    
        qInfo()<<"END";
    
        return a.exec();
    }
    
    

    This code works but very slow and takes 1Gb RAM.
    How can i improve the program because i plan to make a lot of filters. The ultimate goal is to get a single list with all the filters?

    JonBJ 1 Reply Last reply
    0
    • mrjjM Offline
      mrjjM Offline
      mrjj
      Lifetime Qt Champion
      wrote on last edited by mrjj
      #2

      Hi and welcome to the forums
      You read all of the file
      filedata=file.readAll();
      I was wondering if they have line endings so you can read it a line at a time ?
      That would reduce the mem. use alot.

      Also look into
      https://doc.qt.io/qt-5/qstringref.html
      To do the splitting as it points to the original source so nothing is copied and hence
      lighter on mem and MUCH faster.

      Also

      bool contNo(QVector<int> vec, int no1)
      {
          for (int i =0; i< vec.size(); ++i)
              if(vec.at(i)==no1)
              {
                  return true;
              }
      }
      

      Here you sent the list as a copy!. if its purely to seach it and not not change it then do
      bool contNo(QVector<int> &vec, int no1)

      So its a reference and not a copy pr call.

      Same with QVector<int> getVectors (QByteArray s)
      here again you sent a copy. and could be faster with &

      R 1 Reply Last reply
      3
      • R Rickz

        Hello community,
        First of all, I wish you all the best. Here is one beginner, so please do not object to any illogical question. :)
        I want to create a program that will be used to filter certain data.
        The data at this time is in a text file in 15M lines, about 300Mb in size.
        I want to create certain criteria by which I would filter that data regardless of the order in which the filter is turned on.
        The goal is to get one list with all the applied filters.

        Txt file looks like this:

        11 13 24 25 56 78 
        1 12 13 41 45 69 87 
        //more lines
        29 33 37 41 45 55 69
        15 24 39 56 78  81 99 
        eof
        

        One line contains seven numbers separated by space, each line is unique sequence of numbers ordered by
        This is my code so far:

        QVector<int> getVectors (QByteArray s) 
        {
            QVector<int> gIntList;
            for(int i = 0; i < 7; i++)
            {
                QVector<int> intList;
                bool isNum = false;
                int n = (s.split(' ')[i]).toInt(&isNum);
                if(isNum)
                    intList << n;
                gIntList.append(intList);
        
            }
            return gIntList;
        }
        
        bool contNo(QVector<int> vec, int no1)
        {
            for (int i =0; i< vec.size(); ++i)
                if(vec.at(i)==no1)
                {
                    return true;
                }
        }
        int main(int argc, char *argv[])
        {
            QCoreApplication a(argc, argv);
            QTime myTimer;
            myTimer.start();
        
            QFile file("300mb-15mlines.txt");
            if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
            {
                qInfo()<<"Can't open file";
            }
        
            QByteArray filedata=file.readAll();
            QByteArrayList balist = filedata.split('\n');
            QVector<QVector<int>> allLines;
            for (int j = 0; j<balist.size(); j++)
            {
                QVector<int> comb;
                comb = getVectors(balist.at(j));
                if(contNo(comb,24) && contNo(comb,78) ){
                    allLines<<comb;
                }
          }
        
            qInfo()<< allLines;
            file.close();
            int nMilliseconds = myTimer.elapsed();
            qInfo()<< "Time: "<< nMilliseconds / 1000;
        
            qInfo()<<"END";
        
            return a.exec();
        }
        
        

        This code works but very slow and takes 1Gb RAM.
        How can i improve the program because i plan to make a lot of filters. The ultimate goal is to get a single list with all the filters?

        JonBJ Offline
        JonBJ Offline
        JonB
        wrote on last edited by JonB
        #3

        @Rickz
        All good points from @mrjj above, which you should read/act on.

        This code works but very slow and takes 1Gb RAM.

        For the RAM consumption you have a potential choice to make:

        • The "easy" way, which is what you have now, is to read all the lines into memory, and apply your "filters" one after the other. That's fine, but will require memory proportional to the whole file size.

        • Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines? Or, can you read line #1, apply filter #1, then apply filter #2 if line #1 passed filter #1, and so on? (This depends on how your filtering works --- does it work on each line independent of any other lines?) If this is the case, and you are prepared to implement this way, your total RAM will be proportionate to just one line size instead of all the data!

        In terms of speed, it does depend how your filters work, but it may be that speed is approximately equal in both cases.

        mrjjM 1 Reply Last reply
        2
        • JonBJ JonB

          @Rickz
          All good points from @mrjj above, which you should read/act on.

          This code works but very slow and takes 1Gb RAM.

          For the RAM consumption you have a potential choice to make:

          • The "easy" way, which is what you have now, is to read all the lines into memory, and apply your "filters" one after the other. That's fine, but will require memory proportional to the whole file size.

          • Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines? Or, can you read line #1, apply filter #1, then apply filter #2 if line #1 passed filter #1, and so on? (This depends on how your filtering works --- does it work on each line independent of any other lines?) If this is the case, and you are prepared to implement this way, your total RAM will be proportionate to just one line size instead of all the data!

          In terms of speed, it does depend how your filters work, but it may be that speed is approximately equal in both cases.

          mrjjM Offline
          mrjjM Offline
          mrjj
          Lifetime Qt Champion
          wrote on last edited by
          #4

          @JonB
          Good points about how the filtering works.

          I also noted that
          if(contNo(comb,24) && contNo(comb,78) )

          will loop through the entire list for both numbers so
          if more contNo checks is to be added then it would/might be more effective to
          check multiple numbers for one loop , like

          ( contNo(comb,24,78,116,155) )
          so we dont loop from the beginning pr number.

          1 Reply Last reply
          0
          • mrjjM mrjj

            Hi and welcome to the forums
            You read all of the file
            filedata=file.readAll();
            I was wondering if they have line endings so you can read it a line at a time ?
            That would reduce the mem. use alot.

            Also look into
            https://doc.qt.io/qt-5/qstringref.html
            To do the splitting as it points to the original source so nothing is copied and hence
            lighter on mem and MUCH faster.

            Also

            bool contNo(QVector<int> vec, int no1)
            {
                for (int i =0; i< vec.size(); ++i)
                    if(vec.at(i)==no1)
                    {
                        return true;
                    }
            }
            

            Here you sent the list as a copy!. if its purely to seach it and not not change it then do
            bool contNo(QVector<int> &vec, int no1)

            So its a reference and not a copy pr call.

            Same with QVector<int> getVectors (QByteArray s)
            here again you sent a copy. and could be faster with &

            R Offline
            R Offline
            Rickz
            wrote on last edited by
            #5

            @mrjj said in The most efficient way to filter a huge amount of data. Mentor needed!:

            You read all of the file
            filedata=file.readAll();
            I was wondering if they have line endings so you can read it a line at a time ?

            Line ends with space, I'm not sure how to detect end of line directly, because of that I create bytearray from whole file and then split it into bytearraylist

            QByteArrayList balist = filedata.split('\n');
            

            BTW I use QVector just to convert each line into int array so I can perform math filters, for example sum of numbers in array (line).

            Here you sent the list as a copy!. if its purely to search it and not not change it then do
            bool contNo(QVector<int> &vec, int no1)
            So its a reference and not a copy pr call.

            Ok thank you, I understand this.

            @JonB said in The most efficient way to filter a huge amount of data. Mentor needed!:

            Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines?

            Well, the file (whole list) is basic, after apply filter#1 we have new filtered list then filter#2 apply on that new list, allready filtered by filter#1 and so on ... .

            For example:

            Basic list:
            10 20 30 40 50 60 70 
            20 30 40 50 60 70 80 
            30 40 50 60 70 80 90 
            40 50 60 70 80 90 100 
            

            Filter#1 have to select only lines that contains number 20 and 30

            new list
            10 20 30 40 50 60 70 
            20 30 40 50 60 70 80 
            

            Now filter #2 have to select only lines that contain numbers greater than 70.
            that will be:

            newer list
            20 30 40 50 60 70 80 
            

            and so on and so on...

            JonBJ 1 Reply Last reply
            0
            • mrjjM Offline
              mrjjM Offline
              mrjj
              Lifetime Qt Champion
              wrote on last edited by mrjj
              #6

              Hi
              If you can do
              QByteArrayList balist = filedata.split('\n');
              and each entry in balist is a line, then it means it has line endings and
              you can read it line by line if you want to reduce memory use.

              Also, if all lines have same amount of values, then maybe calling reserve on QVector
              https://doc.qt.io/qt-5/qvector.html#reserve
              can also help so it wont have to expand pr value.

              You really have 15 million lines ? :)
              That is some text file :)

              1 Reply Last reply
              0
              • R Rickz

                @mrjj said in The most efficient way to filter a huge amount of data. Mentor needed!:

                You read all of the file
                filedata=file.readAll();
                I was wondering if they have line endings so you can read it a line at a time ?

                Line ends with space, I'm not sure how to detect end of line directly, because of that I create bytearray from whole file and then split it into bytearraylist

                QByteArrayList balist = filedata.split('\n');
                

                BTW I use QVector just to convert each line into int array so I can perform math filters, for example sum of numbers in array (line).

                Here you sent the list as a copy!. if its purely to search it and not not change it then do
                bool contNo(QVector<int> &vec, int no1)
                So its a reference and not a copy pr call.

                Ok thank you, I understand this.

                @JonB said in The most efficient way to filter a huge amount of data. Mentor needed!:

                Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines?

                Well, the file (whole list) is basic, after apply filter#1 we have new filtered list then filter#2 apply on that new list, allready filtered by filter#1 and so on ... .

                For example:

                Basic list:
                10 20 30 40 50 60 70 
                20 30 40 50 60 70 80 
                30 40 50 60 70 80 90 
                40 50 60 70 80 90 100 
                

                Filter#1 have to select only lines that contains number 20 and 30

                new list
                10 20 30 40 50 60 70 
                20 30 40 50 60 70 80 
                

                Now filter #2 have to select only lines that contain numbers greater than 70.
                that will be:

                newer list
                20 30 40 50 60 70 80 
                

                and so on and so on...

                JonBJ Offline
                JonBJ Offline
                JonB
                wrote on last edited by JonB
                #7

                @Rickz
                So since from your example your filters are simple sequential and do not need to look at other values, if you wish you can change so you simply read one line at a time and pass it through all the filters. Reading all the lines into memory/an array at the start does not gain you anything. In which case RAM will reduce from (300MB * whatever for each line) to about 100 bytes for one line! Up to you.

                1 Reply Last reply
                1

                • Login

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • Users
                • Groups
                • Search
                • Get Qt Extensions
                • Unsolved