Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

The most efficient way to filter a huge amount of data. Mentor needed!



  • Hello community,
    First of all, I wish you all the best. Here is one beginner, so please do not object to any illogical question. :)
    I want to create a program that will be used to filter certain data.
    The data at this time is in a text file in 15M lines, about 300Mb in size.
    I want to create certain criteria by which I would filter that data regardless of the order in which the filter is turned on.
    The goal is to get one list with all the applied filters.

    Txt file looks like this:

    11 13 24 25 56 78 
    1 12 13 41 45 69 87 
    //more lines
    29 33 37 41 45 55 69
    15 24 39 56 78  81 99 
    eof
    

    One line contains seven numbers separated by space, each line is unique sequence of numbers ordered by
    This is my code so far:

    QVector<int> getVectors (QByteArray s) 
    {
        QVector<int> gIntList;
        for(int i = 0; i < 7; i++)
        {
            QVector<int> intList;
            bool isNum = false;
            int n = (s.split(' ')[i]).toInt(&isNum);
            if(isNum)
                intList << n;
            gIntList.append(intList);
    
        }
        return gIntList;
    }
    
    bool contNo(QVector<int> vec, int no1)
    {
        for (int i =0; i< vec.size(); ++i)
            if(vec.at(i)==no1)
            {
                return true;
            }
    }
    int main(int argc, char *argv[])
    {
        QCoreApplication a(argc, argv);
        QTime myTimer;
        myTimer.start();
    
        QFile file("300mb-15mlines.txt");
        if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
        {
            qInfo()<<"Can't open file";
        }
    
        QByteArray filedata=file.readAll();
        QByteArrayList balist = filedata.split('\n');
        QVector<QVector<int>> allLines;
        for (int j = 0; j<balist.size(); j++)
        {
            QVector<int> comb;
            comb = getVectors(balist.at(j));
            if(contNo(comb,24) && contNo(comb,78) ){
                allLines<<comb;
            }
      }
    
        qInfo()<< allLines;
        file.close();
        int nMilliseconds = myTimer.elapsed();
        qInfo()<< "Time: "<< nMilliseconds / 1000;
    
        qInfo()<<"END";
    
        return a.exec();
    }
    
    

    This code works but very slow and takes 1Gb RAM.
    How can i improve the program because i plan to make a lot of filters. The ultimate goal is to get a single list with all the filters?


  • Lifetime Qt Champion

    Hi and welcome to the forums
    You read all of the file
    filedata=file.readAll();
    I was wondering if they have line endings so you can read it a line at a time ?
    That would reduce the mem. use alot.

    Also look into
    https://doc.qt.io/qt-5/qstringref.html
    To do the splitting as it points to the original source so nothing is copied and hence
    lighter on mem and MUCH faster.

    Also

    bool contNo(QVector<int> vec, int no1)
    {
        for (int i =0; i< vec.size(); ++i)
            if(vec.at(i)==no1)
            {
                return true;
            }
    }
    

    Here you sent the list as a copy!. if its purely to seach it and not not change it then do
    bool contNo(QVector<int> &vec, int no1)

    So its a reference and not a copy pr call.

    Same with QVector<int> getVectors (QByteArray s)
    here again you sent a copy. and could be faster with &



  • @Rickz
    All good points from @mrjj above, which you should read/act on.

    This code works but very slow and takes 1Gb RAM.

    For the RAM consumption you have a potential choice to make:

    • The "easy" way, which is what you have now, is to read all the lines into memory, and apply your "filters" one after the other. That's fine, but will require memory proportional to the whole file size.

    • Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines? Or, can you read line #1, apply filter #1, then apply filter #2 if line #1 passed filter #1, and so on? (This depends on how your filtering works --- does it work on each line independent of any other lines?) If this is the case, and you are prepared to implement this way, your total RAM will be proportionate to just one line size instead of all the data!

    In terms of speed, it does depend how your filters work, but it may be that speed is approximately equal in both cases.


  • Lifetime Qt Champion

    @JonB
    Good points about how the filtering works.

    I also noted that
    if(contNo(comb,24) && contNo(comb,78) )

    will loop through the entire list for both numbers so
    if more contNo checks is to be added then it would/might be more effective to
    check multiple numbers for one loop , like

    ( contNo(comb,24,78,116,155) )
    so we dont loop from the beginning pr number.



  • @mrjj said in The most efficient way to filter a huge amount of data. Mentor needed!:

    You read all of the file
    filedata=file.readAll();
    I was wondering if they have line endings so you can read it a line at a time ?

    Line ends with space, I'm not sure how to detect end of line directly, because of that I create bytearray from whole file and then split it into bytearraylist

    QByteArrayList balist = filedata.split('\n');
    

    BTW I use QVector just to convert each line into int array so I can perform math filters, for example sum of numbers in array (line).

    Here you sent the list as a copy!. if its purely to search it and not not change it then do
    bool contNo(QVector<int> &vec, int no1)
    So its a reference and not a copy pr call.

    Ok thank you, I understand this.

    @JonB said in The most efficient way to filter a huge amount of data. Mentor needed!:

    Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines?

    Well, the file (whole list) is basic, after apply filter#1 we have new filtered list then filter#2 apply on that new list, allready filtered by filter#1 and so on ... .

    For example:

    Basic list:
    10 20 30 40 50 60 70 
    20 30 40 50 60 70 80 
    30 40 50 60 70 80 90 
    40 50 60 70 80 90 100 
    

    Filter#1 have to select only lines that contains number 20 and 30

    new list
    10 20 30 40 50 60 70 
    20 30 40 50 60 70 80 
    

    Now filter #2 have to select only lines that contain numbers greater than 70.
    that will be:

    newer list
    20 30 40 50 60 70 80 
    

    and so on and so on...


  • Lifetime Qt Champion

    Hi
    If you can do
    QByteArrayList balist = filedata.split('\n');
    and each entry in balist is a line, then it means it has line endings and
    you can read it line by line if you want to reduce memory use.

    Also, if all lines have same amount of values, then maybe calling reserve on QVector
    https://doc.qt.io/qt-5/qvector.html#reserve
    can also help so it wont have to expand pr value.

    You really have 15 million lines ? :)
    That is some text file :)



  • @Rickz
    So since from your example your filters are simple sequential and do not need to look at other values, if you wish you can change so you simply read one line at a time and pass it through all the filters. Reading all the lines into memory/an array at the start does not gain you anything. In which case RAM will reduce from (300MB * whatever for each line) to about 100 bytes for one line! Up to you.


Log in to reply