Strange behavior when reading text file line by line with QTextStream



  • I have this basic program:

    @ QFile file("h:/test.txt");
    file.open(QFile::ReadOnly | QFile::Text);
    QTextStream in(&file);

    bool found = false;
    uint pos = 0;
    
    do {
        QString temp = in.readLine();
        int p = temp.indexOf("something");
        if (p < 0) {
            pos += temp.length() + 1;
        } else {
            pos += p;
            found = true;
        }
    } while (!found && !in.atEnd());
    
    in.seek(0);
    QString text = in.read(pos);
    cout << text.toStdString() << endl;@
    

    The input file looks like this:

    bq. this is line one, the first line
    this is line two, it is second
    this is the third line
    and this is line 4
    line 5 goes here
    and finally, there is line number 6

    The idea is of course, to find the first occurrence of a string and load the text file from start to that location. Passing strings that are on the first 5 lines results in the expected output:

    with indexOf("first") output is:

    bq. this is line one, the

    with "cond":

    bq. this is line one, the first line
    this is line two, it is se

    with "here":

    bq. this is line one, the first line
    this is line two, it is second
    this is the third line
    and this is line 4
    line 5 goes

    However, if I pass "num" that is on the last line I get an unexpected result:

    bq. this is line one, the first line
    this is line two, it is second
    this is the third line
    and this is line 4
    line 5 goes here
    and finally, there is

    There are 5 symbols missing on line 6, if it was line 7 there would be 6 symbols missing and so on, all the lines but the last behave normally, the last line cuts lineNumber - 1 symbols.

    Maybe it's because its 5 AM, but I've been starring at this for line 30 minutes and cannot figure out why... so humiliating...



  • This seems to be a bug...
    I tried putting in debug statements.
    Your code works if you have a newline character at the end of the last line.
    strangely,
    if you don't have a newline at the end, the in.readAll() call returns
    "line number 6"

    if you do, it returns
    "number 6"

    In both the cases the pos value remains the same.
    So as a possible work around you should probably append a new line at the end of file.

    Another strange observation. I did some experiment,
    I took the file without the last line containing a newline at the end and did a in.readAll().size(). It returned me 163 which is correct.
    Then I added a new line at the end of the last line and did the same thing. It returned me 159 which is very strange, whereas it should have returned me 165. Therefore it clearly is a bug. You should log one.



  • Any other ideas?



  • Running this on a Windows machine by any chance? Your logic allows for the length of each unmatched line plus one byte. On Windows the end-of-line marker is two bytes. So, for each unmatched line you read your pos value is incremented by one byte less than it should be. When you slurp pos bytes at the end you are slurping fewer bytes that you should be.

    I'd use a different approach that is not fussed by line endings. If the files are typically small then something like:
    @
    const QString lookingFor("blah");
    QFile file("h:/test.txt");
    if (file.open(QFile::ReadOnly)) { // line ending conversion not wanted
    QByteArray data = s.readAll();
    const int pos = data.indexOf(lookingFor.toUtf8());
    // must allow for encoding differences ^^^^^^^^
    if (pos >= 0)
    data.truncate(pos);
    }
    QString result = QString::fromUtf8(data);
    @
    I assume the file is UTF8 encoded, you might need to adjust.



  • If EOL is two bytes, then why I get the expected result for all lines except for the last? I should be losing one character for every line but that is not the case. Compensating with two bytes for each line doesn't produce the expected behavior either.



  • There is no guarantee line endings are consistent within one file.

    I agree to ChrisW67, that you shouldn't write code that depends on a specific EOL convention.

    If the file is too big to read all into memory, I would do something like:

    @const qint64 BUFFSIZE = 100*1024; //100 KB
    const QByteArray lookingFor = QString("blah").toUtf8();
    QByteArray data;
    QFile file("h:/test.txt");
    qint64 pos = -1;
    if(file.open(QFile::ReadOnly))
    {
    data.append(file.read(BUFFSIZE));
    const int index = data.indexOf(lookingFor);
    if(index >= 0)
    {
    pos = file.pos() - data.length() + index;
    break;
    }
    data = data.right(lookingFor.length() - 1);
    }@

    Note: We need some overlap to handle the case where the string is on the boundary between two buffers.



  • Yes, we all agree the solution is not ideal, but the thread is not about a better solution but about the strange behavior this one produces.

    What puzzles me is why the inconsistency. If the problem is in the EOL character being 2 bytes, then I should be losing a character for each line. But no characters are lost save for the last line. That's what I am failing to understand why and would like to know.



  • I was going of a common off-by-one issue with text files on Windows.

    I cannot reproduce any issue with the first solution: whether the code is run on Windows, Linux, with either line ending, with or without a trailing EOL marker on the last line. I don't see prady_80's "bug" or your inconsistent behaviour.

    Edit: Damn... I wasn't seeing the obvious. I'll look into it



  • Changing the file's EOL from CR/LF to LF/CR fixes the problem. Can't provide more insight though :)



  • According to the size of my input the EOL character is 1 byte. The size of the file is exactly the number of characters plus number of new lines.

    Something weird is happening on the last line, and so far it has me completely puzzled. Even started a "bounty at SO":http://stackoverflow.com/questions/15850133/qtextstream-behavior-searching-for-a-string-not-as-expected, hopefully someone will shine light on this issue. Not that there aren't workarounds but it got my curiosity.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.