Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Parsing large/big text files (quickly)
QtWS25 Last Chance

Parsing large/big text files (quickly)

Scheduled Pinned Locked Moved Solved General and Desktop
9 Posts 5 Posters 3.6k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    m0ng00se
    wrote on 7 Jan 2019, 17:34 last edited by m0ng00se 1 Jul 2019, 17:40
    #1

    Hi all,

    I have a big/large text file I need to parse (around 300MB, 8.7m lines). It has the following format, like I said about 8.7 million of them:

    "1    689 0 0 02WA Cawleys South"
    

    I need to extract the following positions/tokens, 0, 1, 4 and 5 (5 needs to be the rest of the line), so on other words:

    index 0 = 1
    index 1 = 689
    index 4 = 02WA
    index 5= Cawleys South

    Just a note, that last one, could be two words, three words, or even 5 words, I don't always know.

    At the moment, I read each line (in a while loop), simplify it to remove those excess spaces, then it's easy to split and get the first 3 values i need.

        if (file.open(QIODevice::ReadOnly | QIODevice::Text))
        {
            QTextStream in(&file);
            QStringList tokens;
            QString line;
            while(in.readLineInto(&line)){
                    line.simplified();
                    tokens = line.split(QRegExp("\\s+"));
                    tokens[0] // this will be 1
                    tokens[1] // this will be 689
                    tokens[4] // this will be 02WA
                    ...
    

    Then I use the section() function to get that last "bit" I need:

                   line.section(" ", 5); //this will be Cawleys South
    

    I then write the data I extracted into another file in csv format.

    .... however, it takes ages !! Probably around 2-3minutes to read through the file. With Java or Python, it takes literally seconds.

    What would be the best way to speed up this process ?

    Thanks all ...

    1 Reply Last reply
    0
    • C Offline
      C Offline
      Christian Ehrlicher
      Lifetime Qt Champion
      wrote on 7 Jan 2019, 17:50 last edited by Christian Ehrlicher 1 Jul 2019, 18:31
      #2

      I would not convert it to a QString at all and the RegExp (esp. the creation inside the loop) is not really fast, dito the QString creations - take a look at splitRef() here.
      But it works without those stuff:

      while (file.canReadLine()) {
        const QByteArray line = file.readLine().simplified();
        const QList<QByteArray> tokens = line.split(' ');
        ...
      }
      

      or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

      // edit: fixed the split to split by space instead \n, thx @mrjj

      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
      Visit the Qt Academy at https://academy.qt.io/catalog

      M 1 Reply Last reply 7 Jan 2019, 22:21
      3
      • M Offline
        M Offline
        m0ng00se
        wrote on 7 Jan 2019, 18:04 last edited by
        #3

        Thank you, I will test that and revert.

        1 Reply Last reply
        0
        • M Offline
          M Offline
          mrjj
          Lifetime Qt Champion
          wrote on 7 Jan 2019, 18:06 last edited by
          #4

          Hi
          Do i read it wrong or should
          line.split('\n');
          not be
          line.split(' ');
          ?

          F 1 Reply Last reply 7 Jan 2019, 18:10
          0
          • M mrjj
            7 Jan 2019, 18:06

            Hi
            Do i read it wrong or should
            line.split('\n');
            not be
            line.split(' ');
            ?

            F Offline
            F Offline
            fcarney
            wrote on 7 Jan 2019, 18:10 last edited by fcarney 1 Jul 2019, 18:11
            #5

            @mrjj said in Parsing large/big text files (quickly):

            Hi
            Do i read it wrong or should
            line.split('\n');
            not be
            line.split(' ');
            ?

            It should be a space. That regex made absolutely no sense as a split character. simplified returns a line with a single space in place of any whitespace. The only change I can thing of is if there is a "space" character defined in Qt that should be used instead of ' '. What does simplified use internally?

            C++ is a perfectly valid school of magic.

            1 Reply Last reply
            0
            • M Offline
              M Offline
              m0ng00se
              wrote on 7 Jan 2019, 18:30 last edited by
              #6

              very interesting, done a quick test with just the following so far:

              takes 17 seconds:

              line.simplified();
              

              Takes 14 seconds:

              const QByteArray line = file.readLine().simplified();
              

              takes just over 2 minutes (130 seconds)

              line.simplified();
              tokens = line.split(QRegExp("\\s+"));
              

              But this only takes 20 seconds, compared to the "Strings" equivalent of 130 seconds

              const QByteArray line = file.readLine().simplified();
              const QList<QByteArray> tokens = line.split(' ');
              

              I think I am on the right track, thanks !

              1 Reply Last reply
              0
              • C Offline
                C Offline
                Christian Ehrlicher
                Lifetime Qt Champion
                wrote on 7 Jan 2019, 18:32 last edited by
                #7

                @m0ng00se said in Parsing large/big text files (quickly):

                tokens = line.split(QRegExp("\s+"));

                This is what I would expect and said above - you're creating a big regexp object for every line. Moving this out of the loop will speed it up too. But using a regexp her is useless as already said.

                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                Visit the Qt Academy at https://academy.qt.io/catalog

                1 Reply Last reply
                2
                • S Offline
                  S Offline
                  SGaist
                  Lifetime Qt Champion
                  wrote on 7 Jan 2019, 20:42 last edited by
                  #8

                  Hi,

                  As an additional note, if you are that keen to use regular expression, use QRegularExpression, QRegExp is deprecated.

                  Interested in AI ? www.idiap.ch
                  Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                  1 Reply Last reply
                  1
                  • C Christian Ehrlicher
                    7 Jan 2019, 17:50

                    I would not convert it to a QString at all and the RegExp (esp. the creation inside the loop) is not really fast, dito the QString creations - take a look at splitRef() here.
                    But it works without those stuff:

                    while (file.canReadLine()) {
                      const QByteArray line = file.readLine().simplified();
                      const QList<QByteArray> tokens = line.split(' ');
                      ...
                    }
                    

                    or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

                    // edit: fixed the split to split by space instead \n, thx @mrjj

                    M Offline
                    M Offline
                    m0ng00se
                    wrote on 7 Jan 2019, 22:21 last edited by
                    #9

                    @Christian-Ehrlicher said in Parsing large/big text files (quickly):

                    or maybe even iterate over line with QByteArray::indexOf(' ', oldIdx)

                    // edit: fixed the split to split by space instead \n, thx @mrjj

                    Thank you, I am now using QByteArray::indexOf and what I used to do before in about 2 minutes, now takes about 17 seconds.. thanks again !

                    1 Reply Last reply
                    0

                    6/9

                    7 Jan 2019, 18:30

                    • Login

                    • Login or register to search.
                    6 out of 9
                    • First post
                      6/9
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • Users
                    • Groups
                    • Search
                    • Get Qt Extensions
                    • Unsolved