Parse CSV with comma inside fields



  • Hi all,
    I have a csv file that can contains some of the following lines:
    @
    "123","ciao"
    34,"ciao mondo"
    "12345","ciao, mondo"
    @

    It's easy to parse the firsts two lines.

    The problem is with the third one.
    If I split the string using comma character I have a problem with "ciao, mondo"...

    Have you got some suggestions...?



  • Real world CSV files are very hard to parse, given all the exceptions and quirks. Maybe the "QT4 CSV file reader":http://sourceforge.net/projects/qcsv/ project can help you here.

    PS: a tag search on "CSV":/search/tag/csv could bring you some other hints too, I didn't check.



  • [quote author="Volker" date="1338284853"]Real world CSV files are very hard to parse, given all the exceptions and quirks. Maybe the "QT4 CSV file reader":http://sourceforge.net/projects/qcsv/ project can help you here.

    PS: a tag search on "CSV":/search/tag/csv could bring you some other hints too, I didn't check.[/quote]

    I already tried it but it doesn't works as expected, I cant find documentation and I'm not very familiar with regular expressions.
    For exmple if I try this:
    @
    QString str = ""1234","aswere"";
    CSV csv(str);
    qDebug() << csv.parseLine();
    @
    I get an empty list...



  • I find this regular expression that seems to works in some cases:
    @
    QString str = ""1234","asw,ere"";
    QRegExp rx("(?:^|,)(\"(?:[^\"]+|\"\")\"|[^,])");

    int pos = 0;
    int count =0;
     while ((pos = rx.indexIn(str, pos)) != -1) {
    
         pos += rx.matchedLength();
         qDebug() << rx.cap(count);
         ++count;
     }
    

    @
    This way I get:
    "1234"
    "asw,ere"

    so it works.

    Now the problem is with numeric fields as in this example:
    @
    "1234","asw,ere",23.34
    @



  • Why not just write a parser yourself. I.e. a function that walks character by character and keeps track of the state.

    So it has a bool variable called "inQuote" and if it encounters a quote character, it flips the inQuote value. If it encounters a comma, it only sees it as a field separator if inQuote is false. That's it. should be no more than... 15 lines or so. And that will be way faster and more flexible than applying a regex.



  • DerManu is right. On a quick thought, I wouldn't say that using a regex does catch all possible corner cases of parsing a CSV. Despite being cumbersome to write and maintain.



  • I know the possibility to parse "by hand" my string but because of CSV is a kind of "standard" I hoped someone already solved my problem with regular expressions... :-)



  • Since there doesn't really appear to be a standard delimiter in use, you could try the lowest common denominator approach. Keep a count of the # of items resulting from a comma split, if you encounter a new line with fewer splits, reiterate over the data container to make those entries have the same # of entries, and then continue parsing.



  • hi, Luca try this
    I read csv in QTableWidget
    @
    QFile file("file_csv.csv");
    QStringList listA;
    int row = 0;
    if (file.open(QIODevice::ReadOnly)){
    while (!file.atEnd()){
    QString line = file.readLine();
    listA = line.split(",");
    ui->listWidget->addItems(listA);
    ui->spinBox_col->setValue(listA.size());
    ui->tableWidget->setColumnCount(listA.size());
    ui->tableWidget->insertRow(row);
    for (int x = 0; x < listA.size(); x++){
    QTableWidgetItem *test = new QTableWidgetItem(listA.at(x));
    ui->tableWidget->setItem(row, x, test);
    }
    row++;
    }
    }
    file.close();
    @



  • Luca: No, Regular Expressions are the wrong tool here, that's why nobody has done it (or was sucessful with it). CSV files with quoting/escaping have a Chomsky type 2 grammar but regular expressions can only work on languages with Chomsky type 3 grammars. Hence it can not work. (If you're not familiar with the terminology, it means that all thinkable regular expressions will be still too dumb to parse CSV.) And even if you dumbed down your CSV quoting rules (e.g. quotes only allowed at field boundaries), your regular expression would become incredibly ugly and thus non-readable for others (or yourself in six months). Do yourself a favor and write a small parser :).

    Skyrim: Your provided code doesn't work. It will break on @"12345","ciao, mondo"@ for example



  • [quote author="Skyrim" date="1338737081"]hi, Luca try this
    I read csv in QTableWidget
    [/quote]

    Thanks Skyrim but as DerManu wrote, It will break on
    @
    "12345","ciao, mondo"
    @

    DerManu, I didn't know nothing about "Chomsky grammars". Thanks for describing me this.

    So as you said, the only solution will be to parse by hand my CSV?


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.