Large text file import into sqlite



  • So I have been working on a program which imports the contents of a text file into a sqlite database, divided by line. This is my first foray into the wonderful world of QT5, but not my first rodeo with C++. I have spent a goodly amount of time Googling this all, and looking through the documentation. I am not a stranger to working through my own problems and figuring out what is wrong.

    In fact, this program is working. I am not asking for help fixing an error, Google does a dang good job of answering those questions for me. I am asking for your experienced advice, simply because I am still pretty new at this.

    Back to my 'issue'. Like I said, i am using sqlite to store a lot of small amounts of text. <30GB sizes of text. (worst case scenario, but still). The program has one functioning button, which allows a user to import a text file into the database. I have got that to work. My problem is that it can import (on my pretty good computer) about 4 lines of text per second. If i have a a 1.2GB file with >200,000,000 lines of text, this becomes an issue.

    Here is my relevant code;
    @void MainWindow::on_pushButton_clicked() {
    //open personal sqlite database
    QSqlDatabase db = QSqlDatabase::addDatabase("QSQLITE");
    db.setDatabaseName("./data.sqlt");

    //make sure the database opened properly
    if(!db.open()) { //just in case
        qFatal("Failed to connect to the database");
    } else {
        qDebug("Succesfully Connected to the database");
    }
    QSqlQuery qry;
    
    //create database with table if it doesn't exist
    qry.prepare("CREATE TABLE IF NOT EXISTS pw (pass VARCHAR(63), value INT(256))");
    if(!qry.exec&#40;&#41;) {
        qDebug() << qry.lastError();
    } else {
        qDebug() << "Succesfully Connected to the table";
    }
    
    //open dialgue for finding new wordlist to import
    QString database = QFileDialog::getOpenFileName(this, tr("Open Database"),"",tr("Password Database (*.*)"));
    QFile fdatabase(database);
    
    if (!fdatabase.open(QIODevice::ReadOnly | QIODevice::Text)) { //if the file doesn't exist
        qDebug("User cancelled alphabetical wordlist import");
    } else {
        //show the dialogue for choosing the wordlist to import
        QProgressDialog progress("Gathering information...", "Abort Import", 0, 0, this);
            progress.show();
            progress.setWindowModality(Qt::WindowModal);
    
        //count the words in the wordlist
        int numo = 0;
        int numo2 = 100;
        while(!fdatabase.atEnd()) {
            //read the text file
            fdatabase.readLine();
    
            //count the lines of text
            numo++;
    
            //keep the user informed
            if(numo2 == numo) {
                progress.setLabelText("Gathering information...\nFound " + QString::number(numo) + " new passwords");
                numo2 += 100;
            }
    
            //ensure the program doesn't freeze
            qApp->processEvents(QEventLoop::ExcludeUserInputEvents);
    
            //the user can cancel importing wordlist if he wants
            if (progress.wasCanceled()) {
                break; }
        }
    
        //restart the database because i don't know how to read the file again
        fdatabase.close();
        fdatabase.open(QIODevice::ReadOnly | QIODevice::Text);
    
        //Update the progress dialogue and set everything ready for actual import
        QString tnumPass;
        QString xnumPass;
        tnumPass = QString::number(numo);
        progress.setLabelText("Importing 1 of " + tnumPass + " new passwords...");
        progress.setMaximum(numo);
        qDebug() << "Counted" << numo << "passwords in selected file" <<  fdatabase.fileName();
        numo = 0;
    
        //read the wordlist into personal sqlite database
        while(!fdatabase.atEnd()) {
            QString line = fdatabase.readLine();
    
            //show progress
            numo++;
            xnumPass = QString::number(numo);
            progress.setValue(numo);
            progress.setLabelText("Importing " + xnumPass + " of " + tnumPass + " new passwords...\nImporting Password: " + line);
    
            //insert new password
            qry.prepare("INSERT INTO pw (pass, value) values ('" + line + "', '1')");
    
            //if the password was not inserted correctly
            if (!qry.exec&#40;&#41;) {
                qDebug() << qry.lastError();
            }
    
            //ensure the program doesn't freeze
            qApp->processEvents(QEventLoop::ExcludeUserInputEvents);
    
            //the user can cancel importing wordlist if he wants
            if (progress.wasCanceled()) {
                break;
            }
        }
    //close the sqlite database
    db.close();
    }
    

    }@

    My question, or plea for advice, then, is this. How do I speed up this process? If I import a 10GB text file, going line by line, at about 4 lines a second into my sqlite database, it will take for ever for it to finish. I can't load it all into memory and then push it to the sqlite database, because I can't guarantee that the end user will have the amount of RAM needed to fit it all.

    Is there a faster way to store this type of data that i am unaware of? Am I doing something, really really, wrong that is causing this to put the data into the sqlite file?

    If you want to compile this yourself, here is the source code (run in debian based linux, on a 3.2ghz dual core AMD processor with a 1TB 7200RPM hard drive) http://www.sendspace.com/file/7t89ta.

    I would love to hear your opinion about what is the best way to work with VERY large text files, and the best way to store them in a easy access format. Any and all contributions would be greatly appreciated.



  • Loading and saving multiple lines will help fast up the process. The question is what would be optimum size of the lines which should be loaded, now this cannot is answered completely, as the file on disk may be fragmented, and reading multiple line may not always help in reading, but loading multiple line in memory will defenetly help in writing to database faster.

    I would say do some trails on different computers and see what numer of lines/memory looks fairly good. As as already said, loading multiple line into memory will help, but again it depends on how fast you could read and write back to disk, as end of the operation each record(line) has to be saved to database file which again is on the same disk. So for large file (if using HDD) you disk head has to switch positions quite often. Also look if you could configure sqlite page/commit size, and I belive you could also manage using a manual commit instead of auto commit so that you have control when data is written to disk.



  • Awesome, I will look into multi-threading with qt. It seems there are some good tutorials / examples already out there. for others googling this, here is one such tutorial. http://www.johanpaul.com/blog/2011/09/easy-threading-with-qtconcurrentrun/

    However, I do not know how I would set up write caching for sql within the qt framework. Do you have any links or search terms that I could use to get that started?



  • Wrap your insert code inside a transaction and you will see huge speed increase. You should probably put
    @
    db.transaction()
    @

    before the while statement and

    @
    db.commit()
    @

    after the loop.

    I would also suggest to reorganize a bit your code into more functions.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.