Parsing large/big text files (quickly)
-
Hi all,
I have a big/large text file I need to parse (around 300MB, 8.7m lines). It has the following format, like I said about 8.7 million of them:
"1 689 0 0 02WA Cawleys South"
I need to extract the following positions/tokens, 0, 1, 4 and 5 (5 needs to be the rest of the line), so on other words:
index 0 = 1
index 1 = 689
index 4 = 02WA
index 5= Cawleys SouthJust a note, that last one, could be two words, three words, or even 5 words, I don't always know.
At the moment, I read each line (in a while loop), simplify it to remove those excess spaces, then it's easy to split and get the first 3 values i need.
if (file.open(QIODevice::ReadOnly | QIODevice::Text)) { QTextStream in(&file); QStringList tokens; QString line; while(in.readLineInto(&line)){ line.simplified(); tokens = line.split(QRegExp("\\s+")); tokens[0] // this will be 1 tokens[1] // this will be 689 tokens[4] // this will be 02WA ...
Then I use the section() function to get that last "bit" I need:
line.section(" ", 5); //this will be Cawleys South
I then write the data I extracted into another file in csv format.
.... however, it takes ages !! Probably around 2-3minutes to read through the file. With Java or Python, it takes literally seconds.
What would be the best way to speed up this process ?
Thanks all ...
-
I would not convert it to a QString at all and the RegExp (esp. the creation inside the loop) is not really fast, dito the QString creations - take a look at splitRef() here.
But it works without those stuff:while (file.canReadLine()) { const QByteArray line = file.readLine().simplified(); const QList<QByteArray> tokens = line.split(' '); ... }
or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)
// edit: fixed the split to split by space instead \n, thx @mrjj
-
Hi
Do i read it wrong or should
line.split('\n');
not be
line.split(' ');
? -
@mrjj said in Parsing large/big text files (quickly):
Hi
Do i read it wrong or should
line.split('\n');
not be
line.split(' ');
?It should be a space. That regex made absolutely no sense as a split character. simplified returns a line with a single space in place of any whitespace. The only change I can thing of is if there is a "space" character defined in Qt that should be used instead of ' '. What does simplified use internally?
-
very interesting, done a quick test with just the following so far:
takes 17 seconds:
line.simplified();
Takes 14 seconds:
const QByteArray line = file.readLine().simplified();
takes just over 2 minutes (130 seconds)
line.simplified(); tokens = line.split(QRegExp("\\s+"));
But this only takes 20 seconds, compared to the "Strings" equivalent of 130 seconds
const QByteArray line = file.readLine().simplified(); const QList<QByteArray> tokens = line.split(' ');
I think I am on the right track, thanks !
-
@m0ng00se said in Parsing large/big text files (quickly):
tokens = line.split(QRegExp("\s+"));
This is what I would expect and said above - you're creating a big regexp object for every line. Moving this out of the loop will speed it up too. But using a regexp her is useless as already said.
-
Hi,
As an additional note, if you are that keen to use regular expression, use QRegularExpression, QRegExp is deprecated.
-
I would not convert it to a QString at all and the RegExp (esp. the creation inside the loop) is not really fast, dito the QString creations - take a look at splitRef() here.
But it works without those stuff:while (file.canReadLine()) { const QByteArray line = file.readLine().simplified(); const QList<QByteArray> tokens = line.split(' '); ... }
or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)
// edit: fixed the split to split by space instead \n, thx @mrjj
@Christian-Ehrlicher said in Parsing large/big text files (quickly):
or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)
// edit: fixed the split to split by space instead \n, thx @mrjj
Thank you, I am now using QByteArray::indexOf and what I used to do before in about 2 minutes, now takes about 17 seconds.. thanks again !