Parsing large/big text files (quickly)

m0ng00se

Hi all,

I have a big/large text file I need to parse (around 300MB, 8.7m lines). It has the following format, like I said about 8.7 million of them:

"1    689 0 0 02WA Cawleys South"

I need to extract the following positions/tokens, 0, 1, 4 and 5 (5 needs to be the rest of the line), so on other words:

index 0 = 1
index 1 = 689
index 4 = 02WA
index 5= Cawleys South

Just a note, that last one, could be two words, three words, or even 5 words, I don't always know.

At the moment, I read each line (in a while loop), simplify it to remove those excess spaces, then it's easy to split and get the first 3 values i need.

    if (file.open(QIODevice::ReadOnly | QIODevice::Text))
    {
        QTextStream in(&file);
        QStringList tokens;
        QString line;
        while(in.readLineInto(&line)){
                line.simplified();
                tokens = line.split(QRegExp("\\s+"));
                tokens[0] // this will be 1
                tokens[1] // this will be 689
                tokens[4] // this will be 02WA
                ...

Then I use the section() function to get that last "bit" I need:

               line.section(" ", 5); //this will be Cawleys South

I then write the data I extracted into another file in csv format.

.... however, it takes ages !! Probably around 2-3minutes to read through the file. With Java or Python, it takes literally seconds.

What would be the best way to speed up this process ?

Thanks all ...

Christian Ehrlicher

I would not convert it to a QString at all and the RegExp (esp. the creation inside the loop) is not really fast, dito the QString creations - take a look at splitRef() here.
But it works without those stuff:

while (file.canReadLine()) {
  const QByteArray line = file.readLine().simplified();
  const QList<QByteArray> tokens = line.split(' ');
  ...
}

or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

// edit: fixed the split to split by space instead \n, thx @mrjj

m0ng00se

Thank you, I will test that and revert.

mrjj

Hi
Do i read it wrong or should
line.split('\n');
not be
line.split(' ');
?

fcarney

@mrjj said in Parsing large/big text files (quickly):

Hi
Do i read it wrong or should
line.split('\n');
not be
line.split(' ');
?

It should be a space. That regex made absolutely no sense as a split character. simplified returns a line with a single space in place of any whitespace. The only change I can thing of is if there is a "space" character defined in Qt that should be used instead of ' '. What does simplified use internally?

m0ng00se

very interesting, done a quick test with just the following so far:

takes 17 seconds:

line.simplified();

Takes 14 seconds:

const QByteArray line = file.readLine().simplified();

takes just over 2 minutes (130 seconds)

line.simplified();
tokens = line.split(QRegExp("\\s+"));

But this only takes 20 seconds, compared to the "Strings" equivalent of 130 seconds

const QByteArray line = file.readLine().simplified();
const QList<QByteArray> tokens = line.split(' ');

I think I am on the right track, thanks !

Christian Ehrlicher

@m0ng00se said in Parsing large/big text files (quickly):

tokens = line.split(QRegExp("\s+"));

This is what I would expect and said above - you're creating a big regexp object for every line. Moving this out of the loop will speed it up too. But using a regexp her is useless as already said.

SGaist

Hi,

As an additional note, if you are that keen to use regular expression, use QRegularExpression, QRegExp is deprecated.

m0ng00se

@Christian-Ehrlicher said in Parsing large/big text files (quickly):

or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

// edit: fixed the split to split by space instead \n, thx @mrjj

Thank you, I am now using QByteArray::indexOf and what I used to do before in about 2 minutes, now takes about 17 seconds.. thanks again !