Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

QRegExp to parse a CSV file



  • Hello,

    I'm trying to use a regular expression to parse a simple CSV file which has this form:

    01;3.6.1;A;C;HELLO;1: quit;UINT8;N.A.;0.7;4.5;"Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua."
    03;5.4.2;F;K;GOODBYE;0: stay;UINT8;N.A.;0.0;1.2;Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
    

    I've found this reg exp:

    (\;|\n|^)(?:"([^"]*(?:""[^"]*)*)"|([^"\;\n]*))
    

    I have tested it here regexr.com and it does the job.

    const QRegExp regExp("(\\;|\\n|^)(?:""([^\"]*(?:\"\"[^\"]*)*)\"|([^\"\\;\\n]*))");
    
    if (!regExp.isValid())
      qDebug() << "Regular expression error " << regExp.errorString();
    
    QString line = csvFile.readLine();
    QStringList fields = line.split(regExp);
    

    But when I run it in my code, only a list of empty string (in wrong number) is returned.

    Can anybody tell me why?


  • Lifetime Qt Champion

    Hi @Merlino,

    for a start, try this:

    #include <QDebug>
    #include <QRegularExpression>
    
    int main(int argc, char *argv[])
    {
        const QString s = R"(
    01;3.6.1;A;C;HELLO;1: quit;UINT8;N.A.;0.7;4.5;"Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
    03;5.4.2;F;K;GOODBYE;0: stay;UINT8;N.A.;0.0;1.2;Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
    )";
        const QRegularExpression regExp(R"x((\;|\n|^)(?:"([^"]*(?:""[^"]*)*)"|([^"\;\n]*)))x");
    
        QRegularExpressionMatchIterator matchIt = regExp.globalMatch(s);
        while (matchIt.hasNext()) {
            const QRegularExpressionMatch match = matchIt.next();
            qDebug() << match.capturedTexts();
        }
    
        return  0;
    }
    

    You will need to fine-tune it, but it goes in the correct direction.

    Output:

    ("\n01", "\n", "", "01")
    (";3.6.1", ";", "", "3.6.1")
    (";A", ";", "", "A")
    (";C", ";", "", "C")
    (";HELLO", ";", "", "HELLO")
    (";1: quit", ";", "", "1: quit")
    (";UINT8", ";", "", "UINT8")
    (";N.A.", ";", "", "N.A.")
    (";0.7", ";", "", "0.7")
    (";4.5", ";", "", "4.5")
    (";", ";", "", "")
    ("\n03", "\n", "", "03")
    (";5.4.2", ";", "", "5.4.2")
    (";F", ";", "", "F")
    (";K", ";", "", "K")
    (";GOODBYE", ";", "", "GOODBYE")
    (";0: stay", ";", "", "0: stay")
    (";UINT8", ";", "", "UINT8")
    (";N.A.", ";", "", "N.A.")
    (";0.0", ";", "", "0.0")
    (";1.2", ";", "", "1.2")
    (";Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.", ";", "", "Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.")
    ("\n", "\n", "", "")
    

    Regards



  • @Merlino Hi, probably because QRegExp is not fully perl regular expression compliant. I guess it should work with QRegularExpression. Other possibility is that you have different configuration (multiline, global, case sensitivity)

    edit: see note from https://doc.qt.io/qt-5/qregexp.html#details

    Note: In Qt 5, the new QRegularExpression class provides a Perl compatible implementation of regular expressions and is recommended in place of QRegExp.



  • @Merlino Can't you simply use line.split(";") ?



  • @Gojir4 no because the string fields can contain punctuation and quotation marks so the simple split would be fooled.


  • Lifetime Qt Champion

    Hi @Merlino,

    use QRegularExpression, please. QRegExp is deprecated since 2012 and will be removed from Qt6.

    Regards



  • @Merlino I see, so make a global match and iterate on the results to fill your QStringList. You regex is already doing the "splitting" job



  • @aha_1980 I have changed my code with QRegularExpression, but the problem is still present


  • Lifetime Qt Champion

    Hi @Merlino,

    for a start, try this:

    #include <QDebug>
    #include <QRegularExpression>
    
    int main(int argc, char *argv[])
    {
        const QString s = R"(
    01;3.6.1;A;C;HELLO;1: quit;UINT8;N.A.;0.7;4.5;"Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
    03;5.4.2;F;K;GOODBYE;0: stay;UINT8;N.A.;0.0;1.2;Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
    )";
        const QRegularExpression regExp(R"x((\;|\n|^)(?:"([^"]*(?:""[^"]*)*)"|([^"\;\n]*)))x");
    
        QRegularExpressionMatchIterator matchIt = regExp.globalMatch(s);
        while (matchIt.hasNext()) {
            const QRegularExpressionMatch match = matchIt.next();
            qDebug() << match.capturedTexts();
        }
    
        return  0;
    }
    

    You will need to fine-tune it, but it goes in the correct direction.

    Output:

    ("\n01", "\n", "", "01")
    (";3.6.1", ";", "", "3.6.1")
    (";A", ";", "", "A")
    (";C", ";", "", "C")
    (";HELLO", ";", "", "HELLO")
    (";1: quit", ";", "", "1: quit")
    (";UINT8", ";", "", "UINT8")
    (";N.A.", ";", "", "N.A.")
    (";0.7", ";", "", "0.7")
    (";4.5", ";", "", "4.5")
    (";", ";", "", "")
    ("\n03", "\n", "", "03")
    (";5.4.2", ";", "", "5.4.2")
    (";F", ";", "", "F")
    (";K", ";", "", "K")
    (";GOODBYE", ";", "", "GOODBYE")
    (";0: stay", ";", "", "0: stay")
    (";UINT8", ";", "", "UINT8")
    (";N.A.", ";", "", "N.A.")
    (";0.0", ";", "", "0.0")
    (";1.2", ";", "", "1.2")
    (";Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.", ";", "", "Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.")
    ("\n", "\n", "", "")
    

    Regards



  • I haven't tested what the regular expressions do, but you might want to augment your test case to include a string value which itself has embedded " or ; characters --- if you intend to support those.



  • @Merlino said in QRegExp to parse a CSV file:

    simple CSV file

    If the file conforms to such format, you should have one and only one marker as field separator. Originally a comma (so the name) but later on some other character (that obviously cannot be part of the field values...)

    @Gojir4
    Can't you simply use line.split(";") ?

    Yes you can. @Gojir4 provided the right answer I guess. From you data example, in your case it seems to be a SCSV file indeed: a semi-colon separated values.

    @Merlino about 2 hours ago
    no because the string fields can contain punctuation and quotation marks so the simple split would be fooled.

    Yes, you'll have punctation and quotation marks in the string fields, but I bet none of such characters will be a semi-colon (;)

    It looks like you're over-complicating your use case.


Log in to reply