The most efficient way to filter a huge amount of data. Mentor needed!
-
Hello community,
First of all, I wish you all the best. Here is one beginner, so please do not object to any illogical question. :)
I want to create a program that will be used to filter certain data.
The data at this time is in a text file in 15M lines, about 300Mb in size.
I want to create certain criteria by which I would filter that data regardless of the order in which the filter is turned on.
The goal is to get one list with all the applied filters.Txt file looks like this:
11 13 24 25 56 78 1 12 13 41 45 69 87 //more lines 29 33 37 41 45 55 69 15 24 39 56 78 81 99 eof
One line contains seven numbers separated by space, each line is unique sequence of numbers ordered by
This is my code so far:QVector<int> getVectors (QByteArray s) { QVector<int> gIntList; for(int i = 0; i < 7; i++) { QVector<int> intList; bool isNum = false; int n = (s.split(' ')[i]).toInt(&isNum); if(isNum) intList << n; gIntList.append(intList); } return gIntList; } bool contNo(QVector<int> vec, int no1) { for (int i =0; i< vec.size(); ++i) if(vec.at(i)==no1) { return true; } } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); QTime myTimer; myTimer.start(); QFile file("300mb-15mlines.txt"); if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) { qInfo()<<"Can't open file"; } QByteArray filedata=file.readAll(); QByteArrayList balist = filedata.split('\n'); QVector<QVector<int>> allLines; for (int j = 0; j<balist.size(); j++) { QVector<int> comb; comb = getVectors(balist.at(j)); if(contNo(comb,24) && contNo(comb,78) ){ allLines<<comb; } } qInfo()<< allLines; file.close(); int nMilliseconds = myTimer.elapsed(); qInfo()<< "Time: "<< nMilliseconds / 1000; qInfo()<<"END"; return a.exec(); }
This code works but very slow and takes 1Gb RAM.
How can i improve the program because i plan to make a lot of filters. The ultimate goal is to get a single list with all the filters? -
Hi and welcome to the forums
You read all of the file
filedata=file.readAll();
I was wondering if they have line endings so you can read it a line at a time ?
That would reduce the mem. use alot.Also look into
https://doc.qt.io/qt-5/qstringref.html
To do the splitting as it points to the original source so nothing is copied and hence
lighter on mem and MUCH faster.Also
bool contNo(QVector<int> vec, int no1) { for (int i =0; i< vec.size(); ++i) if(vec.at(i)==no1) { return true; } }
Here you sent the list as a copy!. if its purely to seach it and not not change it then do
bool contNo(QVector<int> &vec, int no1)So its a reference and not a copy pr call.
Same with QVector<int> getVectors (QByteArray s)
here again you sent a copy. and could be faster with & -
@Rickz
All good points from @mrjj above, which you should read/act on.This code works but very slow and takes 1Gb RAM.
For the RAM consumption you have a potential choice to make:
-
The "easy" way, which is what you have now, is to read all the lines into memory, and apply your "filters" one after the other. That's fine, but will require memory proportional to the whole file size.
-
Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines? Or, can you read line #1, apply filter #1, then apply filter #2 if line #1 passed filter #1, and so on? (This depends on how your filtering works --- does it work on each line independent of any other lines?) If this is the case, and you are prepared to implement this way, your total RAM will be proportionate to just one line size instead of all the data!
In terms of speed, it does depend how your filters work, but it may be that speed is approximately equal in both cases.
-
-
@JonB
Good points about how the filtering works.I also noted that
if(contNo(comb,24) && contNo(comb,78) )will loop through the entire list for both numbers so
if more contNo checks is to be added then it would/might be more effective to
check multiple numbers for one loop , like( contNo(comb,24,78,116,155) )
so we dont loop from the beginning pr number. -
@mrjj said in The most efficient way to filter a huge amount of data. Mentor needed!:
You read all of the file
filedata=file.readAll();
I was wondering if they have line endings so you can read it a line at a time ?Line ends with space, I'm not sure how to detect end of line directly, because of that I create bytearray from whole file and then split it into bytearraylist
QByteArrayList balist = filedata.split('\n');
BTW I use QVector just to convert each line into int array so I can perform math filters, for example sum of numbers in array (line).
Here you sent the list as a copy!. if its purely to search it and not not change it then do
bool contNo(QVector<int> &vec, int no1)
So its a reference and not a copy pr call.Ok thank you, I understand this.
@JonB said in The most efficient way to filter a huge amount of data. Mentor needed!:
Do you have to apply filter #1 to the whole of the file before you can proceed to apply filter #2 to the remaining lines?
Well, the file (whole list) is basic, after apply filter#1 we have new filtered list then filter#2 apply on that new list, allready filtered by filter#1 and so on ... .
For example:
Basic list: 10 20 30 40 50 60 70 20 30 40 50 60 70 80 30 40 50 60 70 80 90 40 50 60 70 80 90 100
Filter#1 have to select only lines that contains number 20 and 30
new list 10 20 30 40 50 60 70 20 30 40 50 60 70 80
Now filter #2 have to select only lines that contain numbers greater than 70.
that will be:newer list 20 30 40 50 60 70 80
and so on and so on...
-
Hi
If you can do
QByteArrayList balist = filedata.split('\n');
and each entry in balist is a line, then it means it has line endings and
you can read it line by line if you want to reduce memory use.Also, if all lines have same amount of values, then maybe calling reserve on QVector
https://doc.qt.io/qt-5/qvector.html#reserve
can also help so it wont have to expand pr value.You really have 15 million lines ? :)
That is some text file :) -
@Rickz
So since from your example your filters are simple sequential and do not need to look at other values, if you wish you can change so you simply read one line at a time and pass it through all the filters. Reading all the lines into memory/an array at the start does not gain you anything. In which case RAM will reduce from (300MB * whatever for each line) to about 100 bytes for one line! Up to you.