Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

Parsing large/big text files (quickly)



  • Hi all,

    I have a big/large text file I need to parse (around 300MB, 8.7m lines). It has the following format, like I said about 8.7 million of them:

    "1    689 0 0 02WA Cawleys South"
    

    I need to extract the following positions/tokens, 0, 1, 4 and 5 (5 needs to be the rest of the line), so on other words:

    index 0 = 1
    index 1 = 689
    index 4 = 02WA
    index 5= Cawleys South

    Just a note, that last one, could be two words, three words, or even 5 words, I don't always know.

    At the moment, I read each line (in a while loop), simplify it to remove those excess spaces, then it's easy to split and get the first 3 values i need.

        if (file.open(QIODevice::ReadOnly | QIODevice::Text))
        {
            QTextStream in(&file);
            QStringList tokens;
            QString line;
            while(in.readLineInto(&line)){
                    line.simplified();
                    tokens = line.split(QRegExp("\\s+"));
                    tokens[0] // this will be 1
                    tokens[1] // this will be 689
                    tokens[4] // this will be 02WA
                    ...
    

    Then I use the section() function to get that last "bit" I need:

                   line.section(" ", 5); //this will be Cawleys South
    

    I then write the data I extracted into another file in csv format.

    .... however, it takes ages !! Probably around 2-3minutes to read through the file. With Java or Python, it takes literally seconds.

    What would be the best way to speed up this process ?

    Thanks all ...


  • Lifetime Qt Champion

    I would not convert it to a QString at all and the RegExp (esp. the creation inside the loop) is not really fast, dito the QString creations - take a look at splitRef() here.
    But it works without those stuff:

    while (file.canReadLine()) {
      const QByteArray line = file.readLine().simplified();
      const QList<QByteArray> tokens = line.split(' ');
      ...
    }
    

    or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

    // edit: fixed the split to split by space instead \n, thx @mrjj



  • Thank you, I will test that and revert.


  • Lifetime Qt Champion

    Hi
    Do i read it wrong or should
    line.split('\n');
    not be
    line.split(' ');
    ?



  • @mrjj said in Parsing large/big text files (quickly):

    Hi
    Do i read it wrong or should
    line.split('\n');
    not be
    line.split(' ');
    ?

    It should be a space. That regex made absolutely no sense as a split character. simplified returns a line with a single space in place of any whitespace. The only change I can thing of is if there is a "space" character defined in Qt that should be used instead of ' '. What does simplified use internally?



  • very interesting, done a quick test with just the following so far:

    takes 17 seconds:

    line.simplified();
    

    Takes 14 seconds:

    const QByteArray line = file.readLine().simplified();
    

    takes just over 2 minutes (130 seconds)

    line.simplified();
    tokens = line.split(QRegExp("\\s+"));
    

    But this only takes 20 seconds, compared to the "Strings" equivalent of 130 seconds

    const QByteArray line = file.readLine().simplified();
    const QList<QByteArray> tokens = line.split(' ');
    

    I think I am on the right track, thanks !


  • Lifetime Qt Champion

    @m0ng00se said in Parsing large/big text files (quickly):

    tokens = line.split(QRegExp("\s+"));

    This is what I would expect and said above - you're creating a big regexp object for every line. Moving this out of the loop will speed it up too. But using a regexp her is useless as already said.


  • Lifetime Qt Champion

    Hi,

    As an additional note, if you are that keen to use regular expression, use QRegularExpression, QRegExp is deprecated.



  • @Christian-Ehrlicher said in Parsing large/big text files (quickly):

    or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

    // edit: fixed the split to split by space instead \n, thx @mrjj

    Thank you, I am now using QByteArray::indexOf and what I used to do before in about 2 minutes, now takes about 17 seconds.. thanks again !


Log in to reply