Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Special Interest Groups
  3. C++ Gurus
  4. Fastest way to read part of 300 Gigabyte binary file
Forum Updated to NodeBB v4.3 + New Features

Fastest way to read part of 300 Gigabyte binary file

Scheduled Pinned Locked Moved Solved C++ Gurus
58 Posts 7 Posters 13.7k Views 5 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • SGaistS SGaist

    Did you consider mapping only the parts that are pertinent to what you want to read ?

    Please_Help_me_DP Offline
    Please_Help_me_DP Offline
    Please_Help_me_D
    wrote on last edited by
    #35

    @SGaist said in Fastest way to read part of 300 Gigabyte binary file:

    Did you consider mapping only the parts that are pertinent to what you want to read ?

    I don't know how to do that but I saw something like this in BOOST C++ documentation . Here is writen:
    What is a memory mapped file?
    File mapping is the association of a file's contents with a portion of the address space of a process. The system creates a file mapping to associate the file and the address space of the process. A mapped region is the portion of address space that the process uses to access the file's contents. A single file mapping can have several mapped regions, so that the user can associate parts of the file with the address space of the process without mapping the entire file in the address space, since the file can be bigger than the whole address space of the process (a 9GB DVD image file in a usual 32 bit systems). Processes read from and write to the file using pointers, just like with dynamic memory.

    Maybe if I could map only regions of my file that I need to read then it would speed up my application? Does Qt provide something like that?

    1 Reply Last reply
    0
    • SGaistS Offline
      SGaistS Offline
      SGaist
      Lifetime Qt Champion
      wrote on last edited by
      #36

      Well... As already said, the map function takes an offset in your file and a size so you can map several regions of it with that. It's nowhere written that you have to passe an offset of zero and the full file size.

      Interested in AI ? www.idiap.ch
      Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

      Please_Help_me_DP 1 Reply Last reply
      1
      • SGaistS SGaist

        Well... As already said, the map function takes an offset in your file and a size so you can map several regions of it with that. It's nowhere written that you have to passe an offset of zero and the full file size.

        Please_Help_me_DP Offline
        Please_Help_me_DP Offline
        Please_Help_me_D
        wrote on last edited by
        #37

        @SGaist I understand that I have offset and size parameters and actually I use them as a single valued numbers. If I want to map several regions of a file then I should use multiple offsets and multiple size but the example below doesn't work:

            qint64 offset[] = {100, 200, 300};
            qint64 size[] = {4, 4, 4};
            qint32* memory = reinterpret_cast<qint32*>(file.map(offset, size));
        

        The error I get is:
        main.cpp:19:57: error: cannot initialize a parameter of type 'qint64' (aka 'long long') with an lvalue of type 'qint64 [3]'
        qfiledevice.h:127:23: note: passing argument to parameter 'offset' here

        JonBJ 1 Reply Last reply
        0
        • SGaistS Offline
          SGaistS Offline
          SGaist
          Lifetime Qt Champion
          wrote on last edited by
          #38

          You can't just replace an input type by an array of the same type. That's not how it's working. And in any case, the returned value of map is the address you'll have to pass to the unmap function.

          You won't avoid using a form of loop or another.

          Interested in AI ? www.idiap.ch
          Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

          1 Reply Last reply
          3
          • Please_Help_me_DP Please_Help_me_D

            @SGaist I understand that I have offset and size parameters and actually I use them as a single valued numbers. If I want to map several regions of a file then I should use multiple offsets and multiple size but the example below doesn't work:

                qint64 offset[] = {100, 200, 300};
                qint64 size[] = {4, 4, 4};
                qint32* memory = reinterpret_cast<qint32*>(file.map(offset, size));
            

            The error I get is:
            main.cpp:19:57: error: cannot initialize a parameter of type 'qint64' (aka 'long long') with an lvalue of type 'qint64 [3]'
            qfiledevice.h:127:23: note: passing argument to parameter 'offset' here

            JonBJ Offline
            JonBJ Offline
            JonB
            wrote on last edited by
            #39

            @Please_Help_me_D
            As @SGaist has said. You will need to make multiple calls to QFileDevice::map(), one for each of the distinct regions you want mapped.

            1 Reply Last reply
            1
            • Please_Help_me_DP Offline
              Please_Help_me_DP Offline
              Please_Help_me_D
              wrote on last edited by
              #40

              @SGaist when I heard the word "loop" then I finnaly got the idea :)
              Here is my code:

              #include <QCoreApplication>
              #include <QFile>
              #include <QVector>
              //#include <QIODevice>
              #include <armadillo>
              using namespace arma;
              
              int main()
              {
                  char segyFile[]{"D:/STACK1_PRESTM.sgy"};
                  QFile file(segyFile);
                  qint64 fSize = file.size();
                  qint64 N = 1734480;
                  qint64 Nb = 2059*4;
                  if (!file.open(QIODevice::ReadOnly)) {
                       //handle error
                  }
                  //qint32* memory = reinterpret_cast<qint32*>(file.map(3608, file.size()-3608));
                  qint32* memory = new qint32;
                  QVector<qint32> FFID(N);
                  std::cout << "started..." << std::endl;
                  wall_clock timer;
                  timer.tic();
                  for (int i = 0; i < N; i++){
                      memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                      FFID[i] = *memory;
                      //std::cout << *memory << std::endl;
                  }
                  double n0 = timer.toc();
                  std::cout << n0 << std::endl;
                  std::cout << "finished!" << std::endl;
              }
              
              

              Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.
              The timing result is almost the same:
              SSD internal

              • QFile::map 97 Seconds (previously it was 86)

              HDD internal

              • QFile::map 223 Seconds (previously it was 216)

              To check the reliability of the results I also made the experiments with whole file mapping as I did before and the timings is the same. So there is no big difference whether to map the whole file or many regions of it

              1 Reply Last reply
              0
              • SGaistS Offline
                SGaistS Offline
                SGaist
                Lifetime Qt Champion
                wrote on last edited by
                #41

                Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.

                I am not sure I understand that question. memory is a pointer to the start of the region you mapped. In any case, you are still not un-mapping anything in your code which is a bad idea.

                In order to be able to answer your question, please explain what your are you going to do with the values you want to retrieve from that file.

                Interested in AI ? www.idiap.ch
                Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                Please_Help_me_DP 1 Reply Last reply
                1
                • SGaistS SGaist

                  Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.

                  I am not sure I understand that question. memory is a pointer to the start of the region you mapped. In any case, you are still not un-mapping anything in your code which is a bad idea.

                  In order to be able to answer your question, please explain what your are you going to do with the values you want to retrieve from that file.

                  Please_Help_me_DP Offline
                  Please_Help_me_DP Offline
                  Please_Help_me_D
                  wrote on last edited by
                  #42

                  @SGaist I forgot to unmap...
                  Here to get an array (or vector) of values in FFID I use:

                      for (int i = 0; i < N; i++){
                          memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                          FFID[i] = *memory;
                          //std::cout << *memory << std::endl;
                      }
                  

                  I do that because memory is a pointer to a single value. If I could write something like:

                      for (int i = 0; i < N; i++){
                          memory[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                      }
                  

                  then I could avoid using FFID.
                  It doesn't make much sense in this case but i'm just interested and maybe this information would be useful in future in other situations.

                  jsulmJ 1 Reply Last reply
                  0
                  • Please_Help_me_DP Please_Help_me_D

                    @SGaist I forgot to unmap...
                    Here to get an array (or vector) of values in FFID I use:

                        for (int i = 0; i < N; i++){
                            memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                            FFID[i] = *memory;
                            //std::cout << *memory << std::endl;
                        }
                    

                    I do that because memory is a pointer to a single value. If I could write something like:

                        for (int i = 0; i < N; i++){
                            memory[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                        }
                    

                    then I could avoid using FFID.
                    It doesn't make much sense in this case but i'm just interested and maybe this information would be useful in future in other situations.

                    jsulmJ Offline
                    jsulmJ Offline
                    jsulm
                    Lifetime Qt Champion
                    wrote on last edited by jsulm
                    #43

                    @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

                    memory is a pointer to a single value

                    No. It is a pointer to a chunk of memory (bytes), you can interpret that memory as you like. You can use memory as array (as in C/C++ an array is simply a pointer to first element):

                    FFID[i] = memory[i];
                    

                    So, there is really no need to map inside the loop.

                    https://forum.qt.io/topic/113070/qt-code-of-conduct

                    JonBJ 1 Reply Last reply
                    1
                    • jsulmJ jsulm

                      @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

                      memory is a pointer to a single value

                      No. It is a pointer to a chunk of memory (bytes), you can interpret that memory as you like. You can use memory as array (as in C/C++ an array is simply a pointer to first element):

                      FFID[i] = memory[i];
                      

                      So, there is really no need to map inside the loop.

                      JonBJ Offline
                      JonBJ Offline
                      JonB
                      wrote on last edited by JonB
                      #44

                      @jsulm
                      I'm afraid this is not what he means/how he is using memory. There are quite distinct, separate, non-contiguous areas of his memory-mapped file he wishes to access. He wishes to call QFile::map() many times, each one mapping a separate area of memory. He will need to retain those mapped addresses so that he can later unmap() them.

                      He should change to an array/list of memoryMappings. I'm not a C++ expert, but his code should be more like:

                      QVector<qint32*> memoryMappings(N);
                      for (int i = 0; i < N; i++){
                          memoryMappings[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                          FFID[i] = *memoryMappings[i]
                      }
                      
                      Please_Help_me_DP 1 Reply Last reply
                      2
                      • JonBJ JonB

                        @jsulm
                        I'm afraid this is not what he means/how he is using memory. There are quite distinct, separate, non-contiguous areas of his memory-mapped file he wishes to access. He wishes to call QFile::map() many times, each one mapping a separate area of memory. He will need to retain those mapped addresses so that he can later unmap() them.

                        He should change to an array/list of memoryMappings. I'm not a C++ expert, but his code should be more like:

                        QVector<qint32*> memoryMappings(N);
                        for (int i = 0; i < N; i++){
                            memoryMappings[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1));
                            FFID[i] = *memoryMappings[i]
                        }
                        
                        Please_Help_me_DP Offline
                        Please_Help_me_DP Offline
                        Please_Help_me_D
                        wrote on last edited by
                        #45

                        @JonB yes, that was exactly what I wanted!
                        Despite of my humble knowledge in C/C++ programming I got an idea :)
                        If I map the adress of the first value that I want to read (3600-3604 bites). Then calling:

                        memory
                        

                        would show me the adress of that value. So my file is stored continuosly on the disk and the second quint4 value has to be on the (memory+4) adress. So if I call:

                        first_value = *memory;
                        second_value = *(memory+4);
                        third_value = *(memory+8);
                        

                        Should this work? Would it be faster? I'm going to try

                        JonBJ 1 Reply Last reply
                        0
                        • Please_Help_me_DP Please_Help_me_D

                          @JonB yes, that was exactly what I wanted!
                          Despite of my humble knowledge in C/C++ programming I got an idea :)
                          If I map the adress of the first value that I want to read (3600-3604 bites). Then calling:

                          memory
                          

                          would show me the adress of that value. So my file is stored continuosly on the disk and the second quint4 value has to be on the (memory+4) adress. So if I call:

                          first_value = *memory;
                          second_value = *(memory+4);
                          third_value = *(memory+8);
                          

                          Should this work? Would it be faster? I'm going to try

                          JonBJ Offline
                          JonBJ Offline
                          JonB
                          wrote on last edited by
                          #46

                          @Please_Help_me_D
                          Huh? Do you mean you are intending to change the physical file content/format to move the data points you want to retrieve so that they are contiguous? Seems pretty surprising to me, one would assume the format is dictated by something else external to your program. But then you never have explained what this data/file is all about....

                          Please_Help_me_DP 1 Reply Last reply
                          0
                          • JonBJ JonB

                            @Please_Help_me_D
                            Huh? Do you mean you are intending to change the physical file content/format to move the data points you want to retrieve so that they are contiguous? Seems pretty surprising to me, one would assume the format is dictated by something else external to your program. But then you never have explained what this data/file is all about....

                            Please_Help_me_DP Offline
                            Please_Help_me_DP Offline
                            Please_Help_me_D
                            wrote on last edited by
                            #47

                            @JonB no I don't want to change the content of a file. My file is like the following:

                            • first 3600 bytes describe the rest of the file. Here I get information how much rows Nb and columns N I have

                            • the rest of the file is a N-time repeating Nb number of bytes. We can represent this part as a matrix with Nb rows (or bytes if we multiply it by 4) and N columns and my task is to read a single row of this matrix, in other words I need to read every Nb byte since some starting byte (say 3600 or 3604 or something)
                              Actually it is a little bit more complicated and some rows of this "matrix" is of qint16, other qint32 adn single.
                              Here what I do and I get the correct values for few first qint32 rows:

                                qint64 N = 44861;
                                qint64 Nb = 100;
                                memory = reinterpret_cast<qint32*>(file.map(3600, 4));
                                for (int i = 0; i < N; i++){
                                    std::cout << memory+i << std::endl; //  adress
                                    std::cout << *(memory+i) << std::endl; // value
                                }
                            

                            But my program breaks whe I try:

                                qint64 N = 44861;
                                qint64 Nb = 100;
                                memory = reinterpret_cast<qint32*>(file.map(3600, 4));
                                for (int i = 0; i < N; i++){
                                    std::cout << memory+i*Nb << std::endl;
                                    std::cout << *(memory+i*Nb) << std::endl;
                                }
                            

                            Application output:
                            15:54:06: C: \ Users \ tasik \ Documents \ Qt_Projects \ build-untitled1-Desktop_Qt_5_12_6_MSVC2017_64_bit-Release \ release \ untitled1.exe starts ...
                            15:54:09: C: \ Users \ tasik \ Documents \ Qt_Projects \ build-untitled1-Desktop_Qt_5_12_6_MSVC2017_64_bit-Release \ release \ untitled1.exe completed with the code -1073741819

                            JonBJ 1 Reply Last reply
                            0
                            • Please_Help_me_DP Offline
                              Please_Help_me_DP Offline
                              Please_Help_me_D
                              wrote on last edited by
                              #48

                              Seems to me that this work only for 124*4 bytes.
                              I just tested how much iterations completed before the program breaks for different Nb:

                                  for (int i = 0; i < N; i++){
                                      std::cout << *(memory+i*Nb) << std::endl;
                                  }
                              
                              • Nb = 1, max_iterator_i = 124
                              • Nb = 2, max_iterator_i = 62
                              • Nb = 4, max_iterator_i = 31
                                So I think that my idea is not such good as I thought :)
                              1 Reply Last reply
                              0
                              • SGaistS Offline
                                SGaistS Offline
                                SGaist
                                Lifetime Qt Champion
                                wrote on last edited by
                                #49

                                @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

                                memory = reinterpret_cast<qint32*>(file.map(3600, 4));

                                You are mapping a region of 4 bytes yet trying to read way past that.

                                Interested in AI ? www.idiap.ch
                                Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                                Please_Help_me_DP 1 Reply Last reply
                                2
                                • Please_Help_me_DP Please_Help_me_D

                                  @JonB no I don't want to change the content of a file. My file is like the following:

                                  • first 3600 bytes describe the rest of the file. Here I get information how much rows Nb and columns N I have

                                  • the rest of the file is a N-time repeating Nb number of bytes. We can represent this part as a matrix with Nb rows (or bytes if we multiply it by 4) and N columns and my task is to read a single row of this matrix, in other words I need to read every Nb byte since some starting byte (say 3600 or 3604 or something)
                                    Actually it is a little bit more complicated and some rows of this "matrix" is of qint16, other qint32 adn single.
                                    Here what I do and I get the correct values for few first qint32 rows:

                                      qint64 N = 44861;
                                      qint64 Nb = 100;
                                      memory = reinterpret_cast<qint32*>(file.map(3600, 4));
                                      for (int i = 0; i < N; i++){
                                          std::cout << memory+i << std::endl; //  adress
                                          std::cout << *(memory+i) << std::endl; // value
                                      }
                                  

                                  But my program breaks whe I try:

                                      qint64 N = 44861;
                                      qint64 Nb = 100;
                                      memory = reinterpret_cast<qint32*>(file.map(3600, 4));
                                      for (int i = 0; i < N; i++){
                                          std::cout << memory+i*Nb << std::endl;
                                          std::cout << *(memory+i*Nb) << std::endl;
                                      }
                                  

                                  Application output:
                                  15:54:06: C: \ Users \ tasik \ Documents \ Qt_Projects \ build-untitled1-Desktop_Qt_5_12_6_MSVC2017_64_bit-Release \ release \ untitled1.exe starts ...
                                  15:54:09: C: \ Users \ tasik \ Documents \ Qt_Projects \ build-untitled1-Desktop_Qt_5_12_6_MSVC2017_64_bit-Release \ release \ untitled1.exe completed with the code -1073741819

                                  JonBJ Offline
                                  JonBJ Offline
                                  JonB
                                  wrote on last edited by JonB
                                  #50

                                  @Please_Help_me_D
                                  I give up, I really don't understand what you think you are trying to achieve.

                                  If the data you want to fetch is physically separated all over the file, as you originally said if that hasn't changed, you are wasting your time trying to miraculously "coalesce/adjacentise" the data access in memory via mapping. It is vain attempt. Whichever way you look at it, if you have a physical hard disk it is going to have seek/move the head to reach discontinuous areas. That is what will "take time", and there is nothing you can do about it.....

                                  1 Reply Last reply
                                  4
                                  • SGaistS SGaist

                                    @Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:

                                    memory = reinterpret_cast<qint32*>(file.map(3600, 4));

                                    You are mapping a region of 4 bytes yet trying to read way past that.

                                    Please_Help_me_DP Offline
                                    Please_Help_me_DP Offline
                                    Please_Help_me_D
                                    wrote on last edited by
                                    #51

                                    @SGaist yes thank you
                                    @JonB I was wrong. Thank you for explanation

                                    1 Reply Last reply
                                    0
                                    • Please_Help_me_DP Offline
                                      Please_Help_me_DP Offline
                                      Please_Help_me_D
                                      wrote on last edited by
                                      #52

                                      Hi all again,

                                      I just noticed one thing:
                                      if I iterate through the mapped file of size 14 GygaBite I can see memory consumption that eats 4 GB of RAM in about 10 seconds. After that I have to stop the execution because of my RAM limit but it doesn't have any signs that it is going to stop growing.

                                      For example this code produces all that I say on Windows 10 x64, Qt 5.14.0, MSVC 64 2017:

                                          qFile = new QFile("myBigFile");
                                          uchar* memFile_uchar = qFile->map(0, qFile->size());
                                          int val;
                                          size_t I = qFile->size();
                                          for(size_t i = 0; i < I; i++){
                                              val = memFile_uchar[i];
                                          }
                                      

                                      Hope somebody is able to explaing that...

                                      PS: When I was using Matlab and memory mapping technique there I was able to see similar behaviour there.

                                      JonBJ 1 Reply Last reply
                                      0
                                      • Please_Help_me_DP Please_Help_me_D

                                        Hi all again,

                                        I just noticed one thing:
                                        if I iterate through the mapped file of size 14 GygaBite I can see memory consumption that eats 4 GB of RAM in about 10 seconds. After that I have to stop the execution because of my RAM limit but it doesn't have any signs that it is going to stop growing.

                                        For example this code produces all that I say on Windows 10 x64, Qt 5.14.0, MSVC 64 2017:

                                            qFile = new QFile("myBigFile");
                                            uchar* memFile_uchar = qFile->map(0, qFile->size());
                                            int val;
                                            size_t I = qFile->size();
                                            for(size_t i = 0; i < I; i++){
                                                val = memFile_uchar[i];
                                            }
                                        

                                        Hope somebody is able to explaing that...

                                        PS: When I was using Matlab and memory mapping technique there I was able to see similar behaviour there.

                                        JonBJ Offline
                                        JonBJ Offline
                                        JonB
                                        wrote on last edited by
                                        #53

                                        @Please_Help_me_D
                                        I'm not sure what you're asking here. You are mapping the whole of the file. As you begin to access data in the mapped area it gets brought into memory, and that takes up space. If you have limited memory, this is not a good idea.

                                        I haven't used memory mapping myself, but presumably if you want to keep memory usage down you have to do something like only map partial areas of the file at a time (arguments to map()), and release previously mapped areas (unmap()). You'd have to test whether that actually results in less memory usage.

                                        If you are limited in memory compared to the size of the file, perhaps you shouldn't be using memory mapping at all. File seeking to desired data won't have a memory overhead. In the code you show you are reading the data just once, so there may not be much difference. Have you actually measured performance with file versus memory-map access?

                                        Please_Help_me_DP 1 Reply Last reply
                                        2
                                        • JonBJ JonB

                                          @Please_Help_me_D
                                          I'm not sure what you're asking here. You are mapping the whole of the file. As you begin to access data in the mapped area it gets brought into memory, and that takes up space. If you have limited memory, this is not a good idea.

                                          I haven't used memory mapping myself, but presumably if you want to keep memory usage down you have to do something like only map partial areas of the file at a time (arguments to map()), and release previously mapped areas (unmap()). You'd have to test whether that actually results in less memory usage.

                                          If you are limited in memory compared to the size of the file, perhaps you shouldn't be using memory mapping at all. File seeking to desired data won't have a memory overhead. In the code you show you are reading the data just once, so there may not be much difference. Have you actually measured performance with file versus memory-map access?

                                          Please_Help_me_DP Offline
                                          Please_Help_me_DP Offline
                                          Please_Help_me_D
                                          wrote on last edited by
                                          #54

                                          @JonB said in Fastest way to read part of 300 Gigabyte binary file:

                                          I'm not sure what you're asking here. You are mapping the whole of the file. As you begin to access data in the mapped area it gets brought into memory, and that takes up space. If you have limited memory, this is not a good idea.

                                          Well this helped me. So I divide my file by portions and unmap() those portions when they become unuseful. In this case there is no such memory consumption
                                          Thank you!

                                          N 2 Replies Last reply
                                          0

                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • Users
                                          • Groups
                                          • Search
                                          • Get Qt Extensions
                                          • Unsolved