Next & Previous line in a text file at N bytes position (N = middle)



  • Hello, i need some help getting to the next and previous line i a file. Basically i have a .txt file with about 735354 bytes (~718KB)
    After opening the file, i want to jump to the middle of the file (correct me if i'm wrong):

    @int min = 0;
    int max = 735354;
    int mid = (min + max) / 2; // 367677 bytes
    file.seek(mid); // seek to 367677 bytes position in the file which has 735354 bytes // file is of type QFile@

    Because i don't know where i am exactly in the file, i want to search for the first occurence of the newline char "\n" to get to the beginning of the line (otherwise it is possible that seek() brought me somewhere in the middle of a text line).

    @// this is enough chars for my case - i'm sure no line will be bigger in my text file
    char lineBuffer[1024];

    // Among other circumstances, "readLine":http://doc.qt.nokia.com/latest/qiodevice.html#readLine reads data until The first '\n' character is read.
    qint64 lineLength=file.readLine(lineBuffer,sizeof(lineBuffer));

    @

    I hope that i'm now positioned almost exactly in the middle of a text file at the beginning of a newline. yes?

    I need some help how to implement functions for the next (i+1) and previous (i-1) lines ?
    I don't know my line number after readLine()..i'm the whole time positioned based on bytes.
    I need this so when i do parsing at some point i could compare some values from the current line vs previous and next line.

    Thanks, vFox



  • Be aware that your heuristic to "calculate" the middle of the file linewise only works if your lines are roughly equally long. The more the line lengths differ, the more it is likely that you do not "land" in the middle of the file:

    @
    This is a the very first line with some some stuff you will have to read and understand.
    Did you understand?
    Yes.
    Really?
    Yes?
    @

    Your calculation will lead to the very first line as the "middle" of the file.

    To be more precisely you might want to scan the file using readLine(), and count the lines.

    Depending on what you want to do with the data, you could also hold the contents in a [[Doc:QStringList]], which would ease the navigation. ~720k of data is not too much to be held in memory at once.



  • Since you need to compare random lines (at least N is a random value to me) I suggest you read the file and store each line into a QList<QString>. After that you compute, depending on the size of each line (one at time) which line corresponds to the middle. The you have the line number and can jump forward and backward. Drawback of this approach is when a file is really big, or huge, so that you need to load all the lines in memory. Optimization can of course be done in such case.



  • [quote author="Volker" date="1317248814"]Be aware that your heuristic to "calculate" the middle of the file linewise only works if your lines are roughly equally long. The more the line lengths differ, the more it is likely that you do not "land" in the middle of the file:

    Your calculation will lead to the very first line as the "middle" of the file.
    To be more precisely you might want to scan the file using readLine(), and count the lines.
    Depending on what you want to do with the data, you could also hold the contents in a [[Doc:QStringList]], which would ease the navigation. ~720k of data is not too much to be held in memory at once.[/quote]

    Hi Volker,

    Yes, text lines in my file indeed have little deviation of a certain length, i don't have the "yes, really, yes case" in your example. But as i mentioned in the first post, for safety reasons i do:
    qint64 lineLength=file.readLine(lineBuffer,sizeof(lineBuffer));

    Ok, this is just an example size for this post, the real reason behind this post is i'm doing a modified binary search on a file that can have sometimes 200mb.

    I have just one core problem - getting the next line and previous line from a certain byte position (knowing that i'm on the beginning of a line) - so that i can extract info from those lines and compare.

    example: if ( dtPrevLine() <= dtKey <= dtNextLine() )....

    My mission is to avoid scanning the whole file. Basically i would have to pick enough bytes for any text line in my file. Remember my current position, for the nextLine() function i would need to read chars until \n, remember how much bytes i have read, and than seek to that position from the previous position...something like that?

    Anyway..i don't have line numbers, and file size grows everyday...



  • Since you are moving on an average line size, you can also suppose your line number is seekBytes / averageLineNumber (of course a rounded int value). Now I suggest you read at least a few lines to get a good average line length before performing the calculation.



  • fluca: Sounds a little guessing and statisticall to me - i need something more solid.

    I think i found a solution for nextLine, i still need to figure out prevLine...
    Here it is a shortened version for calculating next line in my project:

    @QFile file(strFileName);
    qint64 nFileSize = file.size();
    qint64 lineLength;
    char lineBuff[1024];
    QDatetime dtMid;

    // because we do seeking often - temp save the positions
    int nPrev_Pos = 0;
    int nCurr_Pos = 0;
    int nNext_Pos = 0;

    int nMin = 0;
    int nMax = nFileSize;
    int nMid = (nMinPos + nMaxPos) / 2;

    file.open(QIODevice::ReadOnly);
    file.seek(nMid ); // seek to nMid bytes
    lineLength=file.readLine(lineBuff,sizeof(lineBuff));

    // after readLine found a "\n" we are positioned at the beginning of a new line
    // (ps: lineBuff is enough)

    // save the current position
    nCurr_Pos = file.pos();

    // read 23 chars from beginning into lineBuff
    // (this is date on every beginning of a line inside the file -> that's what i'm comparing)
    lineLength = file.readLine(lineBuff, 24);

    // convert the date string from the file into QDateTime object
    dtMid = QDateTime::fromString(QString(lineBuff),"yyyy-MM-dd HH.mm.ss.zzz");

    // Do this also for the next line!!
    QDateTime dtNext = nextLine(file,nCurr_Pos);

    // compare QDateTime objects
    if( dtMid < dtNext )
    {
    QString ii = dtNext.toString(); // break point - check value - yes, yes..
    }

    ...

    // nextLine
    QDateTime nextLine(QFile &file, int nPos)
    {
    char lineBuff[1024];
    qint64 lineLength;
    QDateTime dtVal;

    lineLength=file.readLine(lineBuff,sizeof(lineBuff)); // read the *next* line
    
     lineLength=file.readLine(lineBuff, 24); // extract datetime
     dtVal = QDateTime::fromString(QString(lineBuff),"yyyy-MM-dd HH.mm.ss.zzz");  
    
     if(dtVal.isValid())
         file.seek(nPos);  // revert back to the original position where i was before calling nextLine
    

    return dtVal;
    }@

    Now, how can i implement readling to read backwards for prevLine() function??


  • Moderators

    Would it be sufficient to find a position in your file, and instead of seeking to that position, point X, in the file, seek N-bytes before that position and read a chunk of data (say N*2 bytes) into a buffer? Then you could parse the buffered data, splitting it into individual lines (perhaps in a QStringList). By knowing that the Nth byte in the buffer lies on the line you seeked to, you could easily determine the line before and the line after.

    As an aside, bear in mind you probably also need to handle the case where your file might have exactly one line of text in it.

    Could you give a little more concrete idea of what you're trying to accomplish with your file in general? What is the task that you're trying to accomplish? Perhaps there is a better algorithm in general, rather than trying to do the seek/read/look-for-line black magick that is being talked about here.

    Edit: changed for content and clarity.



  • You can start reading on an arbitrary position in your file, but to find the beginning or end of the line, you will have to scan every character.

    For the rest, I just have to endorse to mlong. It would be more helpful if we knew what you want to achieve.



  • I have a file (.log) that is filled with millions of textual lines, each line begins with a date and time string, an example of one line:

    2011-06-06 06.36.16.788 Information BUSINESS ClientIP=...|ServerIp=... etc

    Because my file is sorted by dates - i can use a binary search algorithm. Because the user has the option to choose from and to dates at file->open dialog, i can speed the parsing (i'm showing data in a TableView gui) - note: parsing 150mb file takes a minute or two. Binary search would search the From and To dates and return the file positions of them, so i can skip alot when parsing the file.

    Now the problem is this:

    1. I don't have an array but a file (disk file)
      2) I don't have an full-sorted structure but... semi-sorted structure, see why? :

    2011-06-06 06.36.16.780
    2011-06-06 06.36.16.785
    2011-06-06 06.36.16.788

    The above dates are sorted but if a user ask for:
    2011-06-06 06.36.16.782
    Then there is NO Match - but the classic Binary Search always look for the exact match.

    I needed to modify binary search. I can't count all the lines (then my bin search doesn't make sense) - i am byte-positioned all the time. And because i often won't find the exact match, i need to consider finding the closest neighbour of the user selected key (date).

    I header file i have the following functions:

    @int Search(QFile &file, QDateTime dtKey,int nMinPos, int nMaxPos, bool bLowerBound);
    QDateTime midDate(QFile &file, int nPos);
    QDateTime nextDate(QFile &file, int nPos);
    QDateTime prevDate(QFile &file, int nPos);
    void prevLine(QFile &file);
    bool nextLine(QFile &file);@

    • Search(..) - returns the actual (new) position of the From & To dates. This function is called twice (one for FromDate & one for ToDate) - after that i call my main parsing function telling to parse only between these to positions (dates)

    • midDate(..) - calculated the middle date - every time i need to seek to the middle i use this functions for comparing this QDateTime object with the actual key

    etc..

    • prevLine, nextLine... - When i eventually find my closest (or exact) date object, these functions actually reposition my file pointer and return the position (int) - those are my needed results.

    I have already implemented most of it, and i'm now at testing. Will report back.

    THX



  • Sounds like an SQL database solution would be much more appropriate for your problem.

    If I was bound to the log files, I would take that in regular intervals and copy the contents into a database. The search in a DB is much more efficient than navigation such a huge file.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.