Search for file-entries in another file

Heinz

Hallo all!

I am getting into performance issues with a, at first sight, simple task. I have to search one file, find all relevant ids and then look up the additional content for that id in another file.

How would you efficiently solve that using Qt? As I have very large files with several 100,000 entries, the obvios approach takes ages...

while (file1.readLineInto(&file1Line)) {
            explode = file1Line.split(";");
            QString ID = explode[1];
            file2.seek(1); //Ignore the headline
            bool found = false;
            while (file2.readLineInto(&file2Line) && !found) {
                file2explode = file2Line.split(";");
                if(file2explode[3] == ID) {
                    qDebug() << "Found first entry!";
                    found = true;
                }
            }

I tried to store the irrelevant IDs in a QStringList and check if the ID is relevant before searching the second file, but that check consumed all the saved time so it didnt really help.

Would it be any good to store the data in a temporary sqlite Database to do the search and checks there? Or should I optimize the files and sort them to be able to write a more efficient search?

Maybe there is anything build into Qt that I am missing, that could speed things up?

How would you solve that problem?

Many thanks in advance!

JonB

@Heinz said in Search for file-entries in another file:

Would it be any good to store the data in a temporary sqlite Database to do the search and checks there? Or should I optimize the files and sort them to be able to write a more efficient search?

Yes. At least one of these. If you have several hundred thousand entries and you say it is too slow you should, since at the moment you are having to look at each and every one in turn.

At the very least, if you search more than once (do you?) you should at least read into memory and set up, say, a QMap of the entries if you do not want to use a database or otherwise sort/organise the inputs.

mrjj

Hi
but how large are those files ? in MB ?

also aware of
https://doc.qt.io/qt-5/qstringref.html
that allows to split using refs as in no copy which can provide a speedup in some use cases.
Like if you load the whole file to memory and split it.

you could also try something like this
https://github.com/ben-strasser/fast-cpp-csv-parser

and see how much faster it really is :)
but you still need to optimize the look-up.

Heinz

Hey guys,

thank you very much for your advice and support!

My example-file1 has 400 entries and 66kb, example-file2 has 50,000 entries and 16.5 MB. So basically i was searching the 50k entries 400 times, stopping each iteration once I found the right entry. That took 156 seconds to finish, around 155 seconds resulted from searching the 50k file again and again.

As my final files will be way larger I tried one of the recommendet approaches from above. Now all 50k entries are loaded into a temporary sqlite-DB and an index is created what takes around 80 seconds in total. At first I was worried, because at this moment I didnt even start iterating and searching yet...

But hey, the searching doesnt even take 1 second now. Still cant believe the performance gain but cant find an error either... so it might as well be true!

Not afraid of large data anymore!! Thanks again!

ChrisW67

file2.seek(1); //Ignore the headline

The comment implies that you think this seeks to the second line in the stream, when it actually seeks to the second byte/char. The first line you read from file2 is part of the first line (header line) of the file. If you are still doing this, then you might have one more entry in your Sqlite database than you expected.

JonB

@Heinz
Trying to understand the figures you quote. Are you saying doing all 50k entries * 400 times == 156 seconds, or are you saying one search took that and then a further 399 * 155 second for the remaining searches?

If you are only talking about reducing a total of 156 seconds down to 80 seconds I am surprised it isn't better than that.

I understand you have 50k+ entries (file #1) to put into a table to search. I think you have 400 entries (file #2) to look up in that. Is that file #2 itself a static file, or is that dynamic/changing? If it is static it might be interesting to: put that second file into a SQLite table too, like the main entries. Then, depending on what you want to achieve, you might find that some kind of SQLite JOIN statement could be used to "line up" all the matches in a single call and get the result, instead of looking for each entry separately. I don't know how efficient SQLite might be at that.

While I'm here, an alternative without using a database might be: I'd read the smaller, 400 entries into a QMap. Then I'd read the larger 50k entries sequentially once (which has to be done to populate a SQLite database anyway), looking up each entry in the map as I go. I would have thought that would be about the fastest, while using little memory.