Fastest way to read part of 300 Gigabyte binary file
-
@J-Hilk
Umm, no, I don't see that. His currentuchar *memory
means it's only picking up bytes from there. And he made hisFFID
beQVector<uchar>
. So he is copying one byte at a time (which is what I think he wants to get rid of), and current code won't have odd-boundary-memory-alignment issue. But new code withqint32*
foruchar*
could have problem....If his offset is always like the example
7996
so it's divisible by 4 always then I would guess the return result fromQFile::map()
will not show any problem. This is an issue which does not arise when reading numbers from file, only from mapping, so just to be aware.@jsulm thank you, that works!
@JonB @J-Hilk I think I see what you are discussing and I keep that in mind.
If I map the part of a file that is is not equal to N*4 (like in the code below) my program doesn't output any error or command line. Compiler says that it was succesfully built and application output throws that it started and one second later it is terminated.#include <QCoreApplication> #include <QFile> #include <QVector> //#include <QIODevice> #include <armadillo> using namespace arma; int main() { char segyFile[]{"C:/Users/tasik/Documents/Qt_Projects/raw_le.sgy"}; QFile file(segyFile); if (!file.open(QIODevice::ReadOnly)) { //handle error } //qint32 *memory = new qint32; //(uchar*)&memory; uchar* memory = file.map(3608, file.size()-3607); // Here the mappable part file.size()-3607 has some remainder of the division by 4 (qint32*) memory; if (memory) { std::cout << "started..." << std::endl; wall_clock timer; qint64 fSize = file.size(); qint64 N = 44861; qint64 Nb = 661*4; QVector<qint32> FFID(N); (uchar *)&FFID; timer.tic(); for(qint64 i = 0; i < N; i++){ FFID[i] = memory[i*Nb]; /*FFID[i+1] = memory[i*Nb+1]; FFID[i+2] = memory[i*Nb+2]; FFID[i+3] = memory[i*Nb+3];*/ std::cout << FFID[i] << std::endl; } double n0 = timer.toc(); std::cout << n0 << std::endl; std::cout << "finished!" << std::endl; } }
-
@jsulm thank you, that works!
@JonB @J-Hilk I think I see what you are discussing and I keep that in mind.
If I map the part of a file that is is not equal to N*4 (like in the code below) my program doesn't output any error or command line. Compiler says that it was succesfully built and application output throws that it started and one second later it is terminated.#include <QCoreApplication> #include <QFile> #include <QVector> //#include <QIODevice> #include <armadillo> using namespace arma; int main() { char segyFile[]{"C:/Users/tasik/Documents/Qt_Projects/raw_le.sgy"}; QFile file(segyFile); if (!file.open(QIODevice::ReadOnly)) { //handle error } //qint32 *memory = new qint32; //(uchar*)&memory; uchar* memory = file.map(3608, file.size()-3607); // Here the mappable part file.size()-3607 has some remainder of the division by 4 (qint32*) memory; if (memory) { std::cout << "started..." << std::endl; wall_clock timer; qint64 fSize = file.size(); qint64 N = 44861; qint64 Nb = 661*4; QVector<qint32> FFID(N); (uchar *)&FFID; timer.tic(); for(qint64 i = 0; i < N; i++){ FFID[i] = memory[i*Nb]; /*FFID[i+1] = memory[i*Nb+1]; FFID[i+2] = memory[i*Nb+2]; FFID[i+3] = memory[i*Nb+3];*/ std::cout << FFID[i] << std::endl; } double n0 = timer.toc(); std::cout << n0 << std::endl; std::cout << "finished!" << std::endl; } }
@Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:
and application output throws that it started and one second later it is terminated.
Yes, that was my point. You won't get a compilation error. You would get a run-time "crash" on something like line
FFID[i] = memory[i*Nb];
. Under Linux you'd get a core dump (if enabled), under Windoze I don't know but would have thought it would bring up a message box of some kind.However, I haven't got time, I don't think the code you've written reflects this. For a start statements
(qint32*) memory;
and(uchar *)&FFID;
are No-Ops (turn compiler warnings level up, you might get a warning of "no effect" for these lines, you should always develop with highest warning level you can). You haven't changed over thememory
toqint32*
, what you seem to think is how to do casts is wrong. This is C/C++ stuff. You'll want something more likeqint32* memory = static_cast<qint32*>(file.map(3608, file.size()-3607));qint32* memory = reinterpret_cast<qint32*>(file.map(3608, file.size()-3607));
but I haven't got time to sort you out. And if you do that you need to understand how to then index it, it won't be the same offsets as you used when it was
uchar*
. Don't try to change toqint32*
for your accesses if you don't know what you're doing cast-wise in C/C++! :) -
@Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:
and application output throws that it started and one second later it is terminated.
Yes, that was my point. You won't get a compilation error. You would get a run-time "crash" on something like line
FFID[i] = memory[i*Nb];
. Under Linux you'd get a core dump (if enabled), under Windoze I don't know but would have thought it would bring up a message box of some kind.However, I haven't got time, I don't think the code you've written reflects this. For a start statements
(qint32*) memory;
and(uchar *)&FFID;
are No-Ops (turn compiler warnings level up, you might get a warning of "no effect" for these lines, you should always develop with highest warning level you can). You haven't changed over thememory
toqint32*
, what you seem to think is how to do casts is wrong. This is C/C++ stuff. You'll want something more likeqint32* memory = static_cast<qint32*>(file.map(3608, file.size()-3607));qint32* memory = reinterpret_cast<qint32*>(file.map(3608, file.size()-3607));
but I haven't got time to sort you out. And if you do that you need to understand how to then index it, it won't be the same offsets as you used when it was
uchar*
. Don't try to change toqint32*
for your accesses if you don't know what you're doing cast-wise in C/C++! :)@JonB said in Fastest way to read part of 300 Gigabyte binary file:
qint32* memory = static_cast<qint32*>(file.map(3608, file.size()-3607));
thank you but this sends me an error:
main.cpp:17:22: error: static_cast from 'uchar *' (aka 'unsigned char *') to 'qint32 *' (aka 'int *') is not allowed
-
@JonB said in Fastest way to read part of 300 Gigabyte binary file:
qint32* memory = static_cast<qint32*>(file.map(3608, file.size()-3607));
thank you but this sends me an error:
main.cpp:17:22: error: static_cast from 'uchar *' (aka 'unsigned char *') to 'qint32 *' (aka 'int *') is not allowed
@Please_Help_me_D
@JonB meant to writereinterpret_cast
notstatic_cast
there are few uses for reinterpret_cast but this is one :) -
@Please_Help_me_D
@JonB meant to writereinterpret_cast
notstatic_cast
there are few uses for reinterpret_cast but this is one :)@J-Hilk ok, now it works :)
-
@SGaist said in Fastest way to read part of 300 Gigabyte binary file:
Did you consider mapping only the parts that are pertinent to what you want to read ?
I don't know how to do that but I saw something like this in BOOST C++ documentation . Here is writen:
What is a memory mapped file?
File mapping is the association of a file's contents with a portion of the address space of a process. The system creates a file mapping to associate the file and the address space of the process. A mapped region is the portion of address space that the process uses to access the file's contents. A single file mapping can have several mapped regions, so that the user can associate parts of the file with the address space of the process without mapping the entire file in the address space, since the file can be bigger than the whole address space of the process (a 9GB DVD image file in a usual 32 bit systems). Processes read from and write to the file using pointers, just like with dynamic memory.Maybe if I could map only regions of my file that I need to read then it would speed up my application? Does Qt provide something like that?
-
Well... As already said, the map function takes an offset in your file and a size so you can map several regions of it with that. It's nowhere written that you have to passe an offset of zero and the full file size.
-
Well... As already said, the map function takes an offset in your file and a size so you can map several regions of it with that. It's nowhere written that you have to passe an offset of zero and the full file size.
@SGaist I understand that I have offset and size parameters and actually I use them as a single valued numbers. If I want to map several regions of a file then I should use multiple offsets and multiple size but the example below doesn't work:
qint64 offset[] = {100, 200, 300}; qint64 size[] = {4, 4, 4}; qint32* memory = reinterpret_cast<qint32*>(file.map(offset, size));
The error I get is:
main.cpp:19:57: error: cannot initialize a parameter of type 'qint64' (aka 'long long') with an lvalue of type 'qint64 [3]'
qfiledevice.h:127:23: note: passing argument to parameter 'offset' here -
You can't just replace an input type by an array of the same type. That's not how it's working. And in any case, the returned value of map is the address you'll have to pass to the unmap function.
You won't avoid using a form of loop or another.
-
@SGaist I understand that I have offset and size parameters and actually I use them as a single valued numbers. If I want to map several regions of a file then I should use multiple offsets and multiple size but the example below doesn't work:
qint64 offset[] = {100, 200, 300}; qint64 size[] = {4, 4, 4}; qint32* memory = reinterpret_cast<qint32*>(file.map(offset, size));
The error I get is:
main.cpp:19:57: error: cannot initialize a parameter of type 'qint64' (aka 'long long') with an lvalue of type 'qint64 [3]'
qfiledevice.h:127:23: note: passing argument to parameter 'offset' here -
@SGaist when I heard the word "loop" then I finnaly got the idea :)
Here is my code:#include <QCoreApplication> #include <QFile> #include <QVector> //#include <QIODevice> #include <armadillo> using namespace arma; int main() { char segyFile[]{"D:/STACK1_PRESTM.sgy"}; QFile file(segyFile); qint64 fSize = file.size(); qint64 N = 1734480; qint64 Nb = 2059*4; if (!file.open(QIODevice::ReadOnly)) { //handle error } //qint32* memory = reinterpret_cast<qint32*>(file.map(3608, file.size()-3608)); qint32* memory = new qint32; QVector<qint32> FFID(N); std::cout << "started..." << std::endl; wall_clock timer; timer.tic(); for (int i = 0; i < N; i++){ memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); FFID[i] = *memory; //std::cout << *memory << std::endl; } double n0 = timer.toc(); std::cout << n0 << std::endl; std::cout << "finished!" << std::endl; }
Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.
The timing result is almost the same:
SSD internal- QFile::map 97 Seconds (previously it was 86)
HDD internal
- QFile::map 223 Seconds (previously it was 216)
To check the reliability of the results I also made the experiments with whole file mapping as I did before and the timings is the same. So there is no big difference whether to map the whole file or many regions of it
-
Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.
I am not sure I understand that question.
memory
is a pointer to the start of the region you mapped. In any case, you are still not un-mapping anything in your code which is a bad idea.In order to be able to answer your question, please explain what your are you going to do with the values you want to retrieve from that file.
-
Is it possible to create to store in memory all the values that I need? Now I only have a pointer to the single value in var memory. Then I could avoid to use assigning values to FFID.
I am not sure I understand that question.
memory
is a pointer to the start of the region you mapped. In any case, you are still not un-mapping anything in your code which is a bad idea.In order to be able to answer your question, please explain what your are you going to do with the values you want to retrieve from that file.
@SGaist I forgot to unmap...
Here to get an array (or vector) of values in FFID I use:for (int i = 0; i < N; i++){ memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); FFID[i] = *memory; //std::cout << *memory << std::endl; }
I do that because memory is a pointer to a single value. If I could write something like:
for (int i = 0; i < N; i++){ memory[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); }
then I could avoid using FFID.
It doesn't make much sense in this case but i'm just interested and maybe this information would be useful in future in other situations. -
@SGaist I forgot to unmap...
Here to get an array (or vector) of values in FFID I use:for (int i = 0; i < N; i++){ memory = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); FFID[i] = *memory; //std::cout << *memory << std::endl; }
I do that because memory is a pointer to a single value. If I could write something like:
for (int i = 0; i < N; i++){ memory[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); }
then I could avoid using FFID.
It doesn't make much sense in this case but i'm just interested and maybe this information would be useful in future in other situations.@Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:
memory is a pointer to a single value
No. It is a pointer to a chunk of memory (bytes), you can interpret that memory as you like. You can use memory as array (as in C/C++ an array is simply a pointer to first element):
FFID[i] = memory[i];
So, there is really no need to map inside the loop.
-
@Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:
memory is a pointer to a single value
No. It is a pointer to a chunk of memory (bytes), you can interpret that memory as you like. You can use memory as array (as in C/C++ an array is simply a pointer to first element):
FFID[i] = memory[i];
So, there is really no need to map inside the loop.
@jsulm
I'm afraid this is not what he means/how he is usingmemory
. There are quite distinct, separate, non-contiguous areas of his memory-mapped file he wishes to access. He wishes to callQFile::map()
many times, each one mapping a separate area of memory. He will need to retain those mapped addresses so that he can laterunmap()
them.He should change to an array/list of
memoryMappings
. I'm not a C++ expert, but his code should be more like:QVector<qint32*> memoryMappings(N); for (int i = 0; i < N; i++){ memoryMappings[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); FFID[i] = *memoryMappings[i] }
-
@jsulm
I'm afraid this is not what he means/how he is usingmemory
. There are quite distinct, separate, non-contiguous areas of his memory-mapped file he wishes to access. He wishes to callQFile::map()
many times, each one mapping a separate area of memory. He will need to retain those mapped addresses so that he can laterunmap()
them.He should change to an array/list of
memoryMappings
. I'm not a C++ expert, but his code should be more like:QVector<qint32*> memoryMappings(N); for (int i = 0; i < N; i++){ memoryMappings[i] = reinterpret_cast<qint32*>(file.map(3600+i*Nb, 1)); FFID[i] = *memoryMappings[i] }
@JonB yes, that was exactly what I wanted!
Despite of my humble knowledge in C/C++ programming I got an idea :)
If I map the adress of the first value that I want to read (3600-3604 bites). Then calling:memory
would show me the adress of that value. So my file is stored continuosly on the disk and the second quint4 value has to be on the (memory+4) adress. So if I call:
first_value = *memory; second_value = *(memory+4); third_value = *(memory+8);
Should this work? Would it be faster? I'm going to try
-
@JonB yes, that was exactly what I wanted!
Despite of my humble knowledge in C/C++ programming I got an idea :)
If I map the adress of the first value that I want to read (3600-3604 bites). Then calling:memory
would show me the adress of that value. So my file is stored continuosly on the disk and the second quint4 value has to be on the (memory+4) adress. So if I call:
first_value = *memory; second_value = *(memory+4); third_value = *(memory+8);
Should this work? Would it be faster? I'm going to try
@Please_Help_me_D
Huh? Do you mean you are intending to change the physical file content/format to move the data points you want to retrieve so that they are contiguous? Seems pretty surprising to me, one would assume the format is dictated by something else external to your program. But then you never have explained what this data/file is all about.... -
@Please_Help_me_D
Huh? Do you mean you are intending to change the physical file content/format to move the data points you want to retrieve so that they are contiguous? Seems pretty surprising to me, one would assume the format is dictated by something else external to your program. But then you never have explained what this data/file is all about....@JonB no I don't want to change the content of a file. My file is like the following:
-
first 3600 bytes describe the rest of the file. Here I get information how much rows Nb and columns N I have
-
the rest of the file is a N-time repeating Nb number of bytes. We can represent this part as a matrix with Nb rows (or bytes if we multiply it by 4) and N columns and my task is to read a single row of this matrix, in other words I need to read every Nb byte since some starting byte (say 3600 or 3604 or something)
Actually it is a little bit more complicated and some rows of this "matrix" is of qint16, other qint32 adn single.
Here what I do and I get the correct values for few first qint32 rows:
qint64 N = 44861; qint64 Nb = 100; memory = reinterpret_cast<qint32*>(file.map(3600, 4)); for (int i = 0; i < N; i++){ std::cout << memory+i << std::endl; // adress std::cout << *(memory+i) << std::endl; // value }
But my program breaks whe I try:
qint64 N = 44861; qint64 Nb = 100; memory = reinterpret_cast<qint32*>(file.map(3600, 4)); for (int i = 0; i < N; i++){ std::cout << memory+i*Nb << std::endl; std::cout << *(memory+i*Nb) << std::endl; }
Application output:
15:54:06: C: \ Users \ tasik \ Documents \ Qt_Projects \ build-untitled1-Desktop_Qt_5_12_6_MSVC2017_64_bit-Release \ release \ untitled1.exe starts ...
15:54:09: C: \ Users \ tasik \ Documents \ Qt_Projects \ build-untitled1-Desktop_Qt_5_12_6_MSVC2017_64_bit-Release \ release \ untitled1.exe completed with the code -1073741819 -
-
Seems to me that this work only for 124*4 bytes.
I just tested how much iterations completed before the program breaks for different Nb:for (int i = 0; i < N; i++){ std::cout << *(memory+i*Nb) << std::endl; }
- Nb = 1, max_iterator_i = 124
- Nb = 2, max_iterator_i = 62
- Nb = 4, max_iterator_i = 31
So I think that my idea is not such good as I thought :)
-
@Please_Help_me_D said in Fastest way to read part of 300 Gigabyte binary file:
memory = reinterpret_cast<qint32*>(file.map(3600, 4));
You are mapping a region of 4 bytes yet trying to read way past that.