Qt Forum

    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Unsolved

    Update: Forum Guidelines & Code of Conduct


    Qt World Summit: Early-Bird Tickets

    Unsolved The most efficient way to filter a huge amount of data. Mentor needed!

    General and Desktop
    3
    7
    71
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R
      Rickz last edited by

      Hello community,
      First of all, I wish you all the best. Here is one beginner, so please do not object to any illogical question. :)
      I want to create a program that will be used to filter certain data.
      The data at this time is in a text file in 15M lines, about 300Mb in size.
      I want to create certain criteria by which I would filter that data regardless of the order in which the filter is turned on.
      The goal is to get one list with all the applied filters.

      Txt file looks like this:

      11 13 24 25 56 78 
      1 12 13 41 45 69 87 
      //more lines
      29 33 37 41 45 55 69
      15 24 39 56 78  81 99 
      eof
      

      One line contains seven numbers separated by space, each line is unique sequence of numbers ordered by
      This is my code so far:

      QVector<int> getVectors (QByteArray s) 
      {
          QVector<int> gIntList;
          for(int i = 0; i < 7; i++)
          {
              QVector<int> intList;
              bool isNum = false;
              int n = (s.split(' ')[i]).toInt(&isNum);
              if(isNum)
                  intList << n;
              gIntList.append(intList);
      
          }
          return gIntList;
      }
      
      bool contNo(QVector<int> vec, int no1)
      {
          for (int i =0; i< vec.size(); ++i)
              if(vec.at(i)==no1)
              {
                  return true;
              }
      }
      int main(int argc, char *argv[])
      {
          QCoreApplication a(argc, argv);
          QTime myTimer;
          myTimer.start();
      
          QFile file("300mb-15mlines.txt");
          if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
          {
              qInfo()<<"Can't open file";
          }
      
          QByteArray filedata=file.readAll();
          QByteArrayList balist = filedata.split('\n');
          QVector<QVector<int>> allLines;
          for (int j = 0; j<balist.size(); j++)
          {
              QVector<int> comb;
              comb = getVectors(balist.at(j));
              if(contNo(comb,24) && contNo(comb,78) ){
                  allLines<<comb;
              }
        }
      
          qInfo()<< allLines;
          file.close();
          int nMilliseconds = myTimer.elapsed();
          qInfo()<< "Time: "<< nMilliseconds / 1000;
      
          qInfo()<<"END";
      
          return a.exec();
      }
      
      

      This code works but very slow and takes 1Gb RAM.
      How can i improve the program because i plan to make a lot of filters. The ultimate goal is to get a single list with all the filters?

      JonB 1 Reply Last reply Reply Quote 0
      • mrjj
        mrjj Lifetime Qt Champion last edited by mrjj

        Hi and welcome to the forums
        You read all of the file
        filedata=file.readAll();
        I was wondering if they have line endings so you can read it a line at a time ?
        That would reduce the mem. use alot.

        Also look into
        https://doc.qt.io/qt-5/qstringref.html
        To do the splitting as it points to the original source so nothing is copied and hence
        lighter on mem and MUCH faster.

        Also

        bool contNo(QVector<int> vec, int no1)
        {
            for (int i =0; i< vec.size(); ++i)
                if(vec.at(i)==no1)
                {
                    return true;
                }
        }
        

        Here you sent the list as a copy!. if its purely to seach it and not not change it then do
        bool contNo(QVector<int> &vec, int no1)

        So its a reference and not a copy pr call.

        Same with QVector<int> getVectors (QByteArray s)
        here again you sent a copy. and could be faster with &

        R 1 Reply Last reply Reply Quote 3
        • JonB
          JonB @Rickz last edited by JonB

          @Rickz
          All good points from @mrjj above, which you should read/act on.

          This code works but very slow and takes 1Gb RAM.

          For the RAM consumption you have a potential choice to make:

          • The "easy" way, which is what you have now, is to read all the lines into memory, and apply your "filters" one after the other. That's fine, but will require memory proportional to the whole file size.

          • Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines? Or, can you read line #1, apply filter #1, then apply filter #2 if line #1 passed filter #1, and so on? (This depends on how your filtering works --- does it work on each line independent of any other lines?) If this is the case, and you are prepared to implement this way, your total RAM will be proportionate to just one line size instead of all the data!

          In terms of speed, it does depend how your filters work, but it may be that speed is approximately equal in both cases.

          mrjj 1 Reply Last reply Reply Quote 2
          • mrjj
            mrjj Lifetime Qt Champion @JonB last edited by

            @JonB
            Good points about how the filtering works.

            I also noted that
            if(contNo(comb,24) && contNo(comb,78) )

            will loop through the entire list for both numbers so
            if more contNo checks is to be added then it would/might be more effective to
            check multiple numbers for one loop , like

            ( contNo(comb,24,78,116,155) )
            so we dont loop from the beginning pr number.

            1 Reply Last reply Reply Quote 0
            • R
              Rickz @mrjj last edited by

              @mrjj said in The most efficient way to filter a huge amount of data. Mentor needed!:

              You read all of the file
              filedata=file.readAll();
              I was wondering if they have line endings so you can read it a line at a time ?

              Line ends with space, I'm not sure how to detect end of line directly, because of that I create bytearray from whole file and then split it into bytearraylist

              QByteArrayList balist = filedata.split('\n');
              

              BTW I use QVector just to convert each line into int array so I can perform math filters, for example sum of numbers in array (line).

              Here you sent the list as a copy!. if its purely to search it and not not change it then do
              bool contNo(QVector<int> &vec, int no1)
              So its a reference and not a copy pr call.

              Ok thank you, I understand this.

              @JonB said in The most efficient way to filter a huge amount of data. Mentor needed!:

              Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines?

              Well, the file (whole list) is basic, after apply filter#1 we have new filtered list then filter#2 apply on that new list, allready filtered by filter#1 and so on ... .

              For example:

              Basic list:
              10 20 30 40 50 60 70 
              20 30 40 50 60 70 80 
              30 40 50 60 70 80 90 
              40 50 60 70 80 90 100 
              

              Filter#1 have to select only lines that contains number 20 and 30

              new list
              10 20 30 40 50 60 70 
              20 30 40 50 60 70 80 
              

              Now filter #2 have to select only lines that contain numbers greater than 70.
              that will be:

              newer list
              20 30 40 50 60 70 80 
              

              and so on and so on...

              JonB 1 Reply Last reply Reply Quote 0
              • mrjj
                mrjj Lifetime Qt Champion last edited by mrjj

                Hi
                If you can do
                QByteArrayList balist = filedata.split('\n');
                and each entry in balist is a line, then it means it has line endings and
                you can read it line by line if you want to reduce memory use.

                Also, if all lines have same amount of values, then maybe calling reserve on QVector
                https://doc.qt.io/qt-5/qvector.html#reserve
                can also help so it wont have to expand pr value.

                You really have 15 million lines ? :)
                That is some text file :)

                1 Reply Last reply Reply Quote 0
                • JonB
                  JonB @Rickz last edited by JonB

                  @Rickz
                  So since from your example your filters are simple sequential and do not need to look at other values, if you wish you can change so you simply read one line at a time and pass it through all the filters. Reading all the lines into memory/an array at the start does not gain you anything. In which case RAM will reduce from (300MB * whatever for each line) to about 100 bytes for one line! Up to you.

                  1 Reply Last reply Reply Quote 1
                  • First post
                    Last post