Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

Fastest way to read part of 300 Gigabyte binary file



  • Hi,
    I have a binary file of size about 300 Gygabyte. To general information about it I need to read every n-th byte. For example in my case I need to read every 8000-th byte (integer 4 bytes) of the file. So I wrote the code to try and it is still running about 2 hours.
    As far as I know it is slow because of fread call time is big and fseek is pretty fast. So I thought maybe if I could call fread only once and give the OFFSET to each byte as vector then maybe I could improve the perfomance. Maybe Qt has something like that? Or what should I try?
    By the way my 300 Gygabyte data file is on the NTFS file system external hardware. I use Windows 10 x64, MSVC x64.

    #include <iostream>
    #include <stdio.h>
    #include <QtEndian>
    #include <QVector>
    
    int main()
    {
        char segyFile[]{"G:/DATA/CDP_FOR_REGLO.sgy"};
        FILE *pFile;
        unsigned long int N = 300000000000/8000;
        QVector<quint32_le> FFID(N); //FFID is a vector of size N, one number takes 4 bytes
    
        pFile = fopen(segyFile, "rb");
        if (pFile == nullptr){
          std::cout << "Error opening segy-file!" << std::endl;
          return 0;
        }
    
        // read every 8000-th byte in loop
        wall_clock timer;
        timer.tic();
        long int offset = 7996;
        for(unsigned long int i = 0; i < N; i++){
            fread(&FFID[i], 4, 1, pFile);
            fseek (pFile , offset , SEEK_CUR); // make OFFSET from current position
        }
        double n0 = timer.toc();
        std::cout << n0 << std::endl;
    }
    

  • Lifetime Qt Champion

    Hi,

    I haven't used it but it looks like you could benefit from the map function.

    Note that depending on what external support your file is on, it could also be a bottleneck.

    Hope it helps


  • Lifetime Qt Champion

    Hi,

    I haven't used it but it looks like you could benefit from the map function.

    Note that depending on what external support your file is on, it could also be a bottleneck.

    Hope it helps



  • @SGaist I will try it today, thank you! I will report here if map is faster.
    But if this is based on memory mapping technique then it has some restrictions that I'm trying to avoid. For example memory mapping allows you to map only files that are located on your computer. If for example you have two computers that are connected by some network (local network for example) and the file is on the 2nd computer then you can't get access to the file from 1st computer. Something like that.
    I encountered that problem when I chained two computers in "cluster" and using Matlab I tryed to use memory mapping and I got error.



  • @Please_Help_me_D
    Assuming you are talking about memmap() et al. No, it's likely not to work on remote file!

    At the risk of being shot down, You can't really do any better/faster than "seek-and-read". At 8,000 bytes apart, it won't help reading all instead of seeking. You might try an unbuffered level like read()/lseek() instead of fread()/fseek() for what you want, it's worth a try, you don't need the buffer-reading that comes with the latter.

    Reading from a 300GB file across a network is indeed going to take some time. 2 hours may not be long! The only way to really speed this up against a network file is to run the code on the server which has the file system local, and request from a client just what you need it to send you remotely.



  • read 8000 Bytes per read without any seeking, user only 4 Byte at beginning of buffer

    sorry for my english



  • @SGaist 15155 seconds (4 hours 12 min) it took to read these data.
    @JonB I'm going to try read()/lseek

    #include <QCoreApplication>
    #include <QFile>
    #include <QVector>
    //#include <QIODevice>
    #include <armadillo>
    using namespace arma;
    
    int main()
    {
        char segyFile[]{"G:/DATA/CDP_FOR_REGLO.sgy"};
        QFile file(segyFile);
        if (!file.open(QIODevice::ReadOnly)) {
             //handle error
        }
        uchar *memory = file.map(0, file.size());
        if (memory) {
            std::cout << "started..." << std::endl;
            wall_clock timer;
            qint64 fSize = file.size();
            qint64 N = 43933814;
            qint64 Nb = 8000;
            QVector<uchar> FFID(N);
            timer.tic();
            for(qint64 i = 0; i < N; i++){
                FFID[i] = memory[i*Nb];
            }
            double n0 = timer.toc();
            std::cout << n0 << std::endl;
            std::cout << "finished!" << std::endl;
        }
    }
    

  • Lifetime Qt Champion

    Did you consider mapping only the parts that are pertinent to what you want to read ?



  • @SGaist if it is possible then I would try. Could you please give me some hints how to do that?
    Also do you know if it is possible to define an array (or vector) of indexes that I want to read and insted of calling loop just write something like FFID[ind0] = memory[ind1];? where ind0 is an array (vector) = {0, 1, 2, 3, ...} and ind1 is an array (vector) = {0, 8000, 16000, 24000, ...}


  • Lifetime Qt Champion

    Well, the first parameter is an offset and the second is a size so you could jump from point to point.



  • @SGaist but as far as I know the offset and the size is a single valued numbers. If I need to get 10th, 20th, 30th elements then I need multiple valued offset, beacause offset is a number of bytes from the beginning of file. Or I misunderstand something?



  • @SGaist @JonB I've tried few ways to read 115 MegaByte file in a way that I described above (read every n-th byte). So the result is:
    fread/fseek = 0.28 seconds
    QFile::map = 0.06 seconds
    std::ifstream/seekg = 0.35 seconds
    _read/_lseek = 0.29 seconds

    So the fastest is memory mapping technique and seems to me that I'm going use it. So as I don't fully understand how to optimize the code with QFile::map could you please explain me how to change it? My data is consisted of qint16 and qint32 and float format. Something like first 10 bytes of qint16, then 12 bytes of qint32 and then 1000 bytes of single and this triplet (10 bytes -> 12 bytes -> 1000 bytes) repeats until the end of file. Is it possible to map the whole file in this complex format?
    If not how to map it in qint32 rather than uchar? Unfortunately in my example below I could only map it in uchar
    I use Armadillo only for timings here.

    #include <QCoreApplication>
    #include <QFile>
    #include <QVector>
    //#include <QIODevice>
    #include <armadillo>
    using namespace arma;
    
    int main()
    {
        char segyFile[]{"C:/Users/tasik/Documents/Qt_Projects/raw_le.sgy"};
        QFile file(segyFile);
        if (!file.open(QIODevice::ReadOnly)) {
             //handle error
        }
        uchar *memory = file.map(3608, file.size()-3608);
        if (memory) {
            std::cout << "started..." << std::endl;
            wall_clock timer;
            qint64 fSize = file.size();
            qint64 N = 44861;
            qint64 Nb = 2640;
            QVector<uchar> FFID(N*4);
            timer.tic();
            for(qint64 i = 0; i < N; i++){
                FFID[i] = memory[i*Nb];
                FFID[i+1] = memory[i*Nb+1];
                FFID[i+2] = memory[i*Nb+2];
                FFID[i+3] = memory[i*Nb+3];
            }
            double n0 = timer.toc();
            std::cout << n0 << std::endl;
            std::cout << "finished!" << std::endl;
        }
    }
    


  • I forgot to notice that in the previous result the file C:/Users/tasik/Documents/Qt_Projects/raw_le.sgy is on the SSD disk. But when I put the file on the internal (local) HHD there was no difference in timing



  • @Please_Help_me_D
    Yes, that would figure then! Meanwhile, I thought earlier on you were saying the file was on the network, that's a very different situation from a local SSD....



  • @JonB maybe I do some confusing things :)
    My computer (laptop) has two devices to store data: SSD and HDD. Windows is installed on SSD. But neither of those two has enough free space to store 300 Gygabyte file. So if I do some manipulation with this file then I use external HDD disk (third device) :)
    Now I got an idea to check the speed to read this small data (115 Megabyte) if I copy it to an external HDD G:/raw_le.sgy. Here is the result:
    fread/fseek = 0.5 seconds
    QFile::map = 0.06 seconds (the only one that didn't change)
    std::ifstream/seekg = 0.6 seconds
    _read/_lseek = 0.4 seconds

    I have to notice that when external HDD is plugged-in then the timings is less stable. My laptop starts to work a little harder from time to time...
    But the interesting thing is that external HDD increase the time of all the methods except memory mapping. Of course I only read 0.18 Megabyte data of 115 Megabyte file and the effect that external HDD is adjusted via USB doesn't hurt much (is negligible) on the resulting timings and we can see that it doesn't depend whether such small data is on internal device or on external. But when dealing with big data file (300 Gygabite) I can suppose that it have the dominant role in timings. I can't check it now because I don't have enough space on laptop but I'm going to try with 13 or 27 Gygabite data right now :D that should be interesting, I need to prepare the space))



  • @Please_Help_me_D
    Especially with memory mapping, I would think caching could easily affect your test timings. You'd better be timing only from clean OS boot!

    I would also guess that memory mapping might suffer from size of file, as caching may be a factor. Testing it with a 100MB file (which can be easily memory cached) may not be representative of performance when the real file will be 300GB.



  • @JonB said in Fastest way to read part of 300 Gigabyte binary file:

    I would also guess that memory mapping might suffer from size of file, as caching may be a factor. Testing it with a 100MB file (which can be easily memory cached) may not be representative of performance when the real file will be 300GB.

    Yes I that is maybe be true... I need to test it
    I will try to stop most of programs (anti-virus first of all) before launching my app.



  • @JonB I got the result. So my file is 13.957 Gygabite (about 14 Gygabite). I read 1734480 int values which is equal to 6.9 Megabite. The result:
    SSD internal

    • fread/fseek 213 Seconds
    • QFile::map 86 Seconds

    HDD internal

    • fread/fseek 350 Seconds
    • QFile::map 216 Seconds

    HDD external

    • fread/fseek 1058 Seconds
    • QFile::map 655 Seconds

    So the fastest way is to use memory mapping. And the most crucial effect when working with big data is in whether I use external HDD or internal SSD/HDD.
    But I need to optimize my QFile::map code I said few messages above. Does anybody know how to do that?

    For fread/fseek I used the code:

    #include <iostream>
    #include <stdio.h>
    #include <QtEndian>
    #include <QVector>
    #include <boost/endian/buffers.hpp>
    #include <boost/static_assert.hpp>
    #include <armadillo>
    using namespace arma;
    using namespace boost::endian;
    
    
    int main()
    {
        char segyFile[]{"G:/DATA/STACK1_PRESTM.sgy"};
        FILE *pFile;
        unsigned long int segySize, nCol;
        unsigned short int dataFormatCode, nRow;
        // since 3600 byte we can represent segyFile as a matrix with number of rows = nRow and number of columns = nCol
        nRow = 2060;
        nCol = 1734480;
        QVector<quint32_le> FFID(nCol);
    
        pFile = fopen(segyFile, "rb");
        if (pFile == nullptr){
          std::cout << "Error opening segy-file!" << std::endl;
          return 0;
        }
    
        // read every (nRow-1)*4 byte starting from 3608 byte, in other word we read only 3rd row
        wall_clock timer;
        timer.tic();
        fseek (pFile , 3608, SEEK_SET);
        long int offset = (nRow-1)*4;
        for(unsigned long int i = 0; i < nCol; i++){
            fread(&FFID[i], 4, 1, pFile);
            fseek (pFile , offset , SEEK_CUR);
            //std::cout << FFID[i] << std::endl;
        }
        double n0 = timer.toc();
        std::cout << n0 << std::endl;
    }
    

    And for QFile::map I used:

    #include <QCoreApplication>
    #include <QFile>
    #include <QVector>
    //#include <QIODevice>
    #include <armadillo>
    using namespace arma;
    
    int main()
    {
        char segyFile[]{"G:/DATA/STACK1_PRESTM.sgy"};
        QFile file(segyFile);
        if (!file.open(QIODevice::ReadOnly)) {
             //handle error
        }
        uchar *memory = file.map(3608, file.size()-3608);
        if (memory) {
            std::cout << "started..." << std::endl;
            wall_clock timer;
            qint64 fSize = file.size();
            qint64 N = 1734480;
            qint64 Nb = 2059*4;
            QVector<uchar> FFID(N*4);
            timer.tic();
            for(qint64 i = 0; i < N; i++){
                FFID[i] = memory[i*Nb];
                FFID[i+1] = memory[i*Nb+1];
                FFID[i+2] = memory[i*Nb+2];
                FFID[i+3] = memory[i*Nb+3];
            }
            double n0 = timer.toc();
            std::cout << n0 << std::endl;
            std::cout << "finished!" << std::endl;
        }
    }
    

  • Moderators

    @Please_Help_me_D
    out of curiosity, do you build and run your tests in release mode?

    Compiler optimizations could go a long way in improving the speed, if you so far only ran debug builds.



  • @J-Hilk
    Out of interest: I hope you are right, but I don't see much in code which spends its time seeking and reading a few bytes out of an enormous file that will benefit from any code optimization. Presumably all the time is being taken in the OS calls themselves....


  • Moderators

    @JonB said in Fastest way to read part of 300 Gigabyte binary file:

    Presumably all the time is being taken in the OS calls themselves....

    you mean, most time is lost during the network access calls? Possibly. But I would expect at least a couple of seconds improvements anyway :)



  • @J-Hilk
    I would not, can't see how it would save anything here. But that aside, the OP wrote earlier:

    @SGaist 15155 seconds (4 hours 12 min) it took to read these data.

    Your "couple of seconds" is not going to be ground-breaking on that timing, is it? ;-)

    OK, the OP has shown a newer, quicker timing. By all means try release optimization, worth a go :)



  • @J-Hilk Yes I did all the experiments in release mode



  • @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

    uchar *memory = file.map(3608, file.size()-3608);

    is it possible to represent *memory as a heap of type qint32 rather than uchar?


  • Lifetime Qt Champion

    @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

    is it possible to represent *memory as a heap of type qint32 rather than uchar?

    Sure, cast the pointer to qint32*



  • @jsulm
    Your answer is in principle correct. However, should we warn the OP that I'm thinking this will only "work" if the return result from the QFile::map() he calls (given his offsets) is suitably aligned at a 32-bit boundary for qint32 * to address without segmenting?? I don't see the Qt docs mentioning whether this is the case for the normally-uchar * return result?


  • Lifetime Qt Champion

    @JonB Could be, I'm not sure


  • Moderators

    @JonB
    well if you take a look at the loop so far:

    for(qint64 i = 0; i < N; i++){
                FFID[i] = memory[i*Nb];
                FFID[i+1] = memory[i*Nb+1];
                FFID[i+2] = memory[i*Nb+2];
                FFID[i+3] = memory[i*Nb+3];
            }
    

    no checks inside the loop nor before, so it's going to hard crash any way, when the file is not int32_t aligned.



  • @J-Hilk
    Umm, no, I don't see that. His current uchar *memory means it's only picking up bytes from there. And he made his FFID be QVector<uchar>. So he is copying one byte at a time (which is what I think he wants to get rid of), and current code won't have odd-boundary-memory-alignment issue. But new code with qint32* for uchar* could have problem....

    If his offset is always like the example 7996 so it's divisible by 4 always then I would guess the return result from QFile::map() will not show any problem. This is an issue which does not arise when reading numbers from file, only from mapping, so just to be aware.


  • Moderators

    @JonB
    really? And what guarantees, that memory[i*Nb+3]; will be part of the valid memory ?

    I assume this, is, what the OP wants to do

    QVector<uchar> FFID(N*4); -> QVector<qint32> FFID(N);
    uchar *memory -> qint32 *memory
    
    and 
    for(qint64 i = 0; i < N; i++){
                FFID[i] = memory[i*Nb];
            }
    


  • @jsulm thank you, that works!
    @JonB @J-Hilk I think I see what you are discussing and I keep that in mind.
    If I map the part of a file that is is not equal to N*4 (like in the code below) my program doesn't output any error or command line. Compiler says that it was succesfully built and application output throws that it started and one second later it is terminated.

    #include <QCoreApplication>
    #include <QFile>
    #include <QVector>
    //#include <QIODevice>
    #include <armadillo>
    using namespace arma;
    
    int main()
    {
        char segyFile[]{"C:/Users/tasik/Documents/Qt_Projects/raw_le.sgy"};
        QFile file(segyFile);
        if (!file.open(QIODevice::ReadOnly)) {
             //handle error
        }
        //qint32 *memory = new qint32;
        //(uchar*)&memory;
        uchar* memory = file.map(3608, file.size()-3607); // Here the mappable part file.size()-3607 has some remainder of the division by 4 
        (qint32*) memory;
        if (memory) {
            std::cout << "started..." << std::endl;
            wall_clock timer;
            qint64 fSize = file.size();
            qint64 N = 44861;
            qint64 Nb = 661*4;
            QVector<qint32> FFID(N);
            (uchar *)&FFID;
            timer.tic();
            for(qint64 i = 0; i < N; i++){
                FFID[i] = memory[i*Nb];
                /*FFID[i+1] = memory[i*Nb+1];
                FFID[i+2] = memory[i*Nb+2];
                FFID[i+3] = memory[i*Nb+3];*/
                std::cout << FFID[i] << std::endl;
            }
            double n0 = timer.toc();
            std::cout << n0 << std::endl;
            std::cout << "finished!" << std::endl;
        }
    }
    


  • @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

    and application output throws that it started and one second later it is terminated.

    Yes, that was my point. You won't get a compilation error. You would get a run-time "crash" on something like line FFID[i] = memory[i*Nb];. Under Linux you'd get a core dump (if enabled), under Windoze I don't know but would have thought it would bring up a message box of some kind.

    However, I haven't got time, I don't think the code you've written reflects this. For a start statements (qint32*) memory; and (uchar *)&FFID; are No-Ops (turn compiler warnings level up, you might get a warning of "no effect" for these lines, you should always develop with highest warning level you can). You haven't changed over the memory to qint32*, what you seem to think is how to do casts is wrong. This is C/C++ stuff. You'll want something more like

    qint32* memory = static_cast<qint32*>(file.map(3608, file.size()-3607));

    qint32* memory = reinterpret_cast<qint32*>(file.map(3608, file.size()-3607)); 
    

    but I haven't got time to sort you out. And if you do that you need to understand how to then index it, it won't be the same offsets as you used when it was uchar*. Don't try to change to qint32* for your accesses if you don't know what you're doing cast-wise in C/C++! :)



  • @JonB said in Fastest way to read part of 300 Gigabyte binary file:

    qint32* memory = static_cast<qint32*>(file.map(3608, file.size()-3607));

    thank you but this sends me an error:

    main.cpp:17:22: error: static_cast from 'uchar *' (aka 'unsigned char *') to 'qint32 *' (aka 'int *') is not allowed
    

  • Moderators

    @Please_Help_me_D
    @JonB meant to write reinterpret_cast not static_cast there are few uses for reinterpret_cast but this is one :)



  • @J-Hilk ok, now it works :)



  • @SGaist said in Fastest way to read part of 300 Gigabyte binary file:

    Did you consider mapping only the parts that are pertinent to what you want to read ?

    I don't know how to do that but I saw something like this in BOOST C++ documentation . Here is writen:
    What is a memory mapped file?
    File mapping is the association of a file's contents with a portion of the address space of a process. The system creates a file mapping to associate the file and the address space of the process. A mapped region is the portion of address space that the process uses to access the file's contents. A single file mapping can have several mapped regions, so that the user can associate parts of the file with the address space of the process without mapping the entire file in the address space, since the file can be bigger than the whole address space of the process (a 9GB DVD image file in a usual 32 bit systems). Processes read from and write to the file using pointers, just like with dynamic memory.

    Maybe if I could map only regions of my file that I need to read then it would speed up my application? Does Qt provide something like that?


  • Lifetime Qt Champion

    Well... As already said, the map function takes an offset in your file and a size so you can map several regions of it with that. It's nowhere written that you have to passe an offset of zero and the full file size.



  • @SGaist I understand that I have offset and size parameters and actually I use them as a single valued numbers. If I want to map several regions of a file then I should use multiple offsets and multiple size but the example below doesn't work:

        qint64 offset[] = {100, 200, 300};
        qint64 size[] = {4, 4, 4};
        qint32* memory = reinterpret_cast<qint32*>(file.map(offset, size));
    

    The error I get is:
    main.cpp:19:57: error: cannot initialize a parameter of type 'qint64' (aka 'long long') with an lvalue of type 'qint64 [3]'
    qfiledevice.h:127:23: note: passing argument to parameter 'offset' here


  • Lifetime Qt Champion

    You can't just replace an input type by an array of the same type. That's not how it's working. And in any case, the returned value of map is the address you'll have to pass to the unmap function.

    You won't avoid using a form of loop or another.



  • @Please_Help_me_D
    As @SGaist has said. You will need to make multiple calls to QFileDevice::map(), one for each of the distinct regions you want mapped.



  • @SGaist when I heard the word "loop" then I finnaly got the idea :)
    Here is my code:

    #include <QCoreApplication>
    #include <QFile>
    #include <QVector>
    //#include <QIODevice>
    #include <armadillo>
    using namespace arma;
    
    int main()
    {
        char segyFile[]{"D:/STACK1_PRESTM.sgy"};
        QFile file(segyFile);
        qint64 fSize = file.size();
        qint64 N = 1734480;
        qint64 Nb = 2059*4;
        if (!file.open(QIODevice::ReadOnly)) {
             //handle error
        }
        //qint32* memory = reinterpret_cast<qint32*>(file.map(3608, file.size()-3608));
        qint32* memory = new qint32;
        QVector<qint32> FFID(N);
        std::cout << "started..." << std::endl;
        wall_clock timer;
        timer.tic();
        for (int i = 0; i < N; i++){
            memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
            FFID[i] = *memory;
            //std::cout << *memory << std::endl;
        }
        double n0 = timer.toc();
        std::cout << n0 << std::endl;
        std::cout << "finished!" << std::endl;
    }
    
    

    Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.
    The timing result is almost the same:
    SSD internal

    • QFile::map 97 Seconds (previously it was 86)

    HDD internal

    • QFile::map 223 Seconds (previously it was 216)

    To check the reliability of the results I also made the experiments with whole file mapping as I did before and the timings is the same. So there is no big difference whether to map the whole file or many regions of it


Log in to reply