Memory problem

salvador

Hey,
i am writing an application in which i read values from a csv file. Each line is splitted where the comma is is stored in a vector<QString>. Each line has 4 values. Then i store each value to an other vector. The csv file has 5.795.857 lines. So in my structure i want to store 4*5.795.857 values. The problem is tha the application crushes. As i see with the debuger i crushes approximately at the 405.000 line. I know my computer is a little bit old, but i think that it should be able to store this amount of values. I run Qt5 on WindowsXP 32bit and i have 1GB ram.

I am new to Qt and c++ programming but as a java developer when i have such problems i increase the heap size. You think this is my problem? If yes how can i increase the heap size at Qt5?

ty

gvedantam

Can you please post your code and the Call Stack that you see in the crash?

salvador

This is the function that is responsible for storing the data.
@std::vector < std::vector < QString> > Server::loadCsvFile( const char* path )
{
vector <QString> temp;
vector <vector <QString> > dataFlow;
string dataString;
QString row;
ifstream dataFile( path );

int stopCounter = 0;

while (dataFile.good())
{
//stopCounter++;
getline( dataFile, dataString );
row = QString::fromStdString( dataString );
//cout << "counter: " << stopCounter << "\n";
QStringList rowList = row.split( "," );

for(  int i=0; i < rowList.size(); i++ )
{
  temp.push_back( rowList.at(i));
}

dataFlow.push_back( temp );
temp.clear();
rowList.clear();

}

dataFile.close();
dataFlow.pop_back();

return dataFlow;
}
@

Actually it isn't crushing. After the function stores around 400.000 lines it gets really slowly and after a while it says not responding.

gvedantam

I am not sure if you really need a nested vector. Looks like thats making your program real slow.
If its just for end of each line you are switching to a new vector then you can demark it by simply placing a special character (may be a #) at the end of each line. This would avoid the nested vector and it could improve the program performance.

After chance may be your loop would like this. (And of course your return type would change)

while (dataFile.good())
{
//stopCounter++;
getline( dataFile, dataString );
row = QString::fromStdString( dataString );
//cout << "counter: " << stopCounter << "\n";
QStringList rowList = row.split( "," );

for(  int i=0; i < rowList.size(); i++ )
{
  temp.push_back( rowList.at(i));
}

temp.push_back("#"); //To Identify your End of Line (EOL)
temp.clear();
rowList.clear();

}

salvador

That's a good solution but i think it's not very practical for the rest of the application to access the values. Also that would make the application faster, but the memory storage problem still exists.

tomma

Can't see how it is Qt's problem when only class you use from Qt is QString.
I think the cause for problem is vector which reallocates memory every time you add new rows to it. Try switching to other container type or allocate enough space to vector at function start.

stereomatching

Do your compiler support rvalue reference(c++11 feature)?If not, change a compiler which support it
this may boost up your speed.If you are stuck at c++98, you can change the container to std::list<std::vector<QString>>.After the jobs are done, you could swap the std::vector<QString> back to the vector.

@size_t const size = dataFlow.size();
std::vector<std::vector<QString> > results(size);

for(size_t i = 0; i != size; ++i){
dataFlow[i].swap(results[i]);
}@

In Qt5, you can enable the c++11 features by adding
@
CONFIG += c++11@

I would expect that in c++11, the vector will move the other vector<QString> into the new memory
rather than copy those vector<QString> when resize, correct me if I am wrong, thanks.

Besides, you could read the file by QFile, than you can save the need of translate std::string to QString

@QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;

 QTextStream in(&file&#41;;
 while (!in.atEnd(&#41;&#41; {
     QString line = in.readLine(&#41;;        
 }@

some alternation with the help of c++11

@std::vector<std::vector < QString> > Server::loadCsvFile( const char* path )
{
vector <vector <QString> > dataFlow;
QFile file(path);
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
throw std::runtime_error("can't open the file " + std::string(path));

       QTextStream in(&file);          
       while (!in.atEnd()) {               
           QStringList rowList = in.readLine().split(",");               
           size_t const SIZE = rowList.size();
           vector <QString> temp(SIZE); //give the vector enough room rather than push
           for(size_t i = 0; i != SIZE; ++i){
               temp[i].swap(rowList[i]);
           }
           
           dataFlow.emplace_back(std::move(temp)); //move the data of temp into the datwFlow rather than copy
       } 
       
       return dataFlow;
}@

I do have a question, why don't you just use std::vector<QStringList> as your data structure?
This could save some headache

@
std::vector<QStringList> Server::loadCsvFile( const char* path )
{
vector<QStringList> dataFlow;
QFile file(path);
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
throw std::runtime_error("can't open the file " + std::string(path));

       QTextStream in(&file);          
       while (!in.atEnd()) {                                                  
           dataFlow.emplace_back(in.readLine().split(","));
       } 
       
       return dataFlow;
}

@

Strongly suggest you shift from c++98(or c++03) to c++11, c++11 is far more better than c++98
on performance(because of rvalue reference) and expressiveness

raven-worx

try using QLinkedList< QVector<QString> > as your data structure. This may already solve your issue. Also try to test your code in release build, since the debug build may also slow down your code with overhead debug information.

salvador

I tried this: QLinkedList< QVector<QString> > and nothing changed.
Also i modified the code as stereomatchi suggest. i used the last suggetion and the problem still exists.

stereomatching

Are you sure the performance bottleneck is loadCsvFile?
The easiest to test is read a file by loadCsvFile without other operation

If it really is the bottelneck, I still have one more solution, yet more complicated

1 :Read the whole file into a single QString or std::string
std::string maybe a better solution if you are running out of memory and don't
need to deal with unicode

2 :design a class to record the position of every line of the std::string

@
//declaration
std::string read_whole_file(char const *file_name, std::ios_base::openmode mode = std::ios::in | std::ios::binary);

//definition
std::string read_whole_file(char const *file_name, std::ios_base::openmode mode)
{
std::ifstream file(file_name, mode);

if(file.is_open(&#41;&#41;{
    file.seekg(0, std::ios_base::end&#41;;
    std::streampos const size = file.tellg();
    file.seekg(0, std::ios_base::beg);        
    std::string result;
    result.reserve(size);
    result.append((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());

    return result;
}else{
    throw std::runtime_error(std::string("\nCan't open file : ") + file_name);
}

}
@

2 : design a class to save the position of each line

ex :
std::vector<StringPosition>

@
struct StringPosition
{
StringPosition() : begin_(0), end_(0) {}
StringPosition(std::string::size_type begin, std::string::size_type end) : begin_(begin), end_(end) {}

std::string::size_type begin_;
std::string::size_type end_;

};
@

@
std::string const results = read_whole_file("/Users/Qt/program/simpleCodes/test01/probability.hpp");
std::vector<StringPosition> str_line;
std::string::size_type pos, last_pos = 0;
while(true){
pos = results.find_first_of('\n', last_pos);
if(pos == std::string::npos){
pos = results.length();

      str_line.emplace_back(last_pos, pos);

      break;
   }
   else{
      if(pos != last_pos)
         str_line.emplace_back(last_pos, pos);
   }

   last_pos = pos + 1;
}

@

I haven't tested, just a simple idea, please refine the codes by yourself

if the function read_whole_file are too slow, you may consider using some c function
if it is still too slow for you, maybe you have to consider other solution like
memory mapped files.Since I never tried it before(don't have a need), I can't give
you any suggestion about memory mapped files, sorry for that.

raven-worx

Ok, lets get through it:

4*5.795.857 = 23183428 items

assuming every item has 4 characters:
23183428 * 8 * 4 = 741869696 bytes

185467424 / 1024 = 724482.5 KB
181120.5 / 1024 = 707.4 MB
150.4 / 1024 = 0.69 GB

For 8 characters per item it will be: 1.38 GB of memory

And this is only the raw data, no overhead of the storage containers, etc.

So it's a sure thing that it becomes slow after a time if you only have 1GB of RAM available on your machine, which a part of it is already used by the OS. So i guess you have ~600MB effectively available.
If the RAM is full the memory has to be transferred to the hard disk which is very expensive and thus slow. Also so for the transfer back from the hard drive to the RAM if a process needs it.

How big is the file after all on the hard disk?
Why do you need to hold the whole file in the memory?

Since it's a more or less simple CVS file and maybe it is indexed/structured (with an id?) i would suggest the following:

traverse the whole CVS once. Do this with QFile.

on every 1000 line for example store the line-number or id and the "QFile::pos()":http://qt-project.org/doc/qt-4.8/qfile.html#pos to a list (e.g. QLinkedList)

Know when you want to access/search an entry of the file do it chunk-wise. Now use the stored pos with "QFile::seek()":http://qt-project.org/doc/qt-4.8/qfile.html#seek and read line by line from the file.

Hope this helps!

salvador

bq. How big is the file after all on the hard disk?

153MB

bq. Why do you need to hold the whole file in the memory?

I have an apllication that is going to use a server to fetch the data. At the moment the server is not available and i want to simulate the server so i store all the data to the data structure.

stereomatching

there are some errors of your calculation

4*5.795.857 = 23183428 lines

assuming every item has 4 characters:
23183428 * 1(should be one byte per char) * 4 = 92733712 bytes

92733712 / 1024 = 90560 KB
181120.5 / 1024 = 88.4375 MB
88.4375 / 1024 = 0.0864 GB

not that big

Maybe you could transform the files to sql format then use c++ to read it?

raven-worx

[quote author="stereomatching" date="1369220484"]there are some errors of your calculatio[/quote]
right ... thanks!

JKSH

Before we try to optimize the algorithm for reading your CSV file, let's do a quick test to see if your computer can hold the data properly (but make sure that your file REALLY has 5 795 857 lines!):

Edit: Fixed code
@
QVector< QList<QByteArray> > testVector(5795857);
QFile file(path);
file.open(QFile::ReadOnly|QFile::Text);
for (int i = 0; i < 5795857; ++i)
testVector[i] = file.readLine().split(',');
file.close();
@

Some questions:

Open the Task Manager (press Ctrl+Shift+Esc) How much RAM do you have before you run your program? How much RAM does the program take up?
Does it crash at line 1? If so, your memory is too fragmented to hold the data. Note: A linked list might be able to hold the data, but reading a linked list is very very slow (unless you only read from start to finish, never jump around, and never go back).
If it doesn't crash, how long does it take to run the test?

stereomatching

My solution wouldn't work since it needs almost 0.69GB to save the positions
you may save the positions every thousand lines as raven-worx suggested
or put the data into the sql.

Before your optimize your codes, maybe you should try the suggestion of JKSH

salvador

I run the test. The amount of ram the programm took up is around 90 MB and the test need about an hour to run. The computer is a little slow (1 processor at 1.6GHz), so i am aware that any optimazation it would be difficult.

JKSH

I just realized there was an error in my code: testVector should be QVector< QList<QByteArray> >, not QVector<QStringList>. I've fixed it in my earlier post.

[quote author="salvador" date="1369391422"]I run the test. The amount of ram the programm took up is around 90 MB and the test need about an hour to run. The computer is a little slow (1 processor at 1.6GHz), so i am aware that any optimazation it would be difficult.[/quote]
Your computer can't handle it I'm afraid. I just made a 390 MB file -- 5795857 lines, where each line says "This is a string" 4 times, separated by commas. It took me 6 seconds to run the test, compiled with Qt 5.1 beta 1 + MSVC2012, and run with my 3.4 GHz CPU + 8GB RAM.

Also, the test code is already very fast because it doesn't convert the containers. When you convert the raw bytes into a string and convert a list into vector, the program will slow down even more. You won't get anything faster than this test (unless you write lots of clever code, maybe).

You could try to simulate the server a different way, like what stereomatching and raven-worx suggested. Both options involve a lot more code though -- you might be better off just waiting for your real server to become available.