QRegExp to parse a CSV file
-
Hello,
I'm trying to use a regular expression to parse a simple CSV file which has this form:
01;3.6.1;A;C;HELLO;1: quit;UINT8;N.A.;0.7;4.5;"Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua." 03;5.4.2;F;K;GOODBYE;0: stay;UINT8;N.A.;0.0;1.2;Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
I've found this reg exp:
(\;|\n|^)(?:"([^"]*(?:""[^"]*)*)"|([^"\;\n]*))
I have tested it here regexr.com and it does the job.
const QRegExp regExp("(\\;|\\n|^)(?:""([^\"]*(?:\"\"[^\"]*)*)\"|([^\"\\;\\n]*))"); if (!regExp.isValid()) qDebug() << "Regular expression error " << regExp.errorString(); QString line = csvFile.readLine(); QStringList fields = line.split(regExp);
But when I run it in my code, only a list of empty string (in wrong number) is returned.
Can anybody tell me why?
-
Hi @Merlino,
for a start, try this:
#include <QDebug> #include <QRegularExpression> int main(int argc, char *argv[]) { const QString s = R"( 01;3.6.1;A;C;HELLO;1: quit;UINT8;N.A.;0.7;4.5;"Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua. 03;5.4.2;F;K;GOODBYE;0: stay;UINT8;N.A.;0.0;1.2;Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua. )"; const QRegularExpression regExp(R"x((\;|\n|^)(?:"([^"]*(?:""[^"]*)*)"|([^"\;\n]*)))x"); QRegularExpressionMatchIterator matchIt = regExp.globalMatch(s); while (matchIt.hasNext()) { const QRegularExpressionMatch match = matchIt.next(); qDebug() << match.capturedTexts(); } return 0; }
You will need to fine-tune it, but it goes in the correct direction.
Output:
("\n01", "\n", "", "01") (";3.6.1", ";", "", "3.6.1") (";A", ";", "", "A") (";C", ";", "", "C") (";HELLO", ";", "", "HELLO") (";1: quit", ";", "", "1: quit") (";UINT8", ";", "", "UINT8") (";N.A.", ";", "", "N.A.") (";0.7", ";", "", "0.7") (";4.5", ";", "", "4.5") (";", ";", "", "") ("\n03", "\n", "", "03") (";5.4.2", ";", "", "5.4.2") (";F", ";", "", "F") (";K", ";", "", "K") (";GOODBYE", ";", "", "GOODBYE") (";0: stay", ";", "", "0: stay") (";UINT8", ";", "", "UINT8") (";N.A.", ";", "", "N.A.") (";0.0", ";", "", "0.0") (";1.2", ";", "", "1.2") (";Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.", ";", "", "Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.") ("\n", "\n", "", "")
Regards
-
@Merlino Hi, probably because QRegExp is not fully perl regular expression compliant. I guess it should work with QRegularExpression. Other possibility is that you have different configuration (multiline, global, case sensitivity)
edit: see note from https://doc.qt.io/qt-5/qregexp.html#details
Note: In Qt 5, the new QRegularExpression class provides a Perl compatible implementation of regular expressions and is recommended in place of QRegExp.
-
Hi @Merlino,
for a start, try this:
#include <QDebug> #include <QRegularExpression> int main(int argc, char *argv[]) { const QString s = R"( 01;3.6.1;A;C;HELLO;1: quit;UINT8;N.A.;0.7;4.5;"Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua. 03;5.4.2;F;K;GOODBYE;0: stay;UINT8;N.A.;0.0;1.2;Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua. )"; const QRegularExpression regExp(R"x((\;|\n|^)(?:"([^"]*(?:""[^"]*)*)"|([^"\;\n]*)))x"); QRegularExpressionMatchIterator matchIt = regExp.globalMatch(s); while (matchIt.hasNext()) { const QRegularExpressionMatch match = matchIt.next(); qDebug() << match.capturedTexts(); } return 0; }
You will need to fine-tune it, but it goes in the correct direction.
Output:
("\n01", "\n", "", "01") (";3.6.1", ";", "", "3.6.1") (";A", ";", "", "A") (";C", ";", "", "C") (";HELLO", ";", "", "HELLO") (";1: quit", ";", "", "1: quit") (";UINT8", ";", "", "UINT8") (";N.A.", ";", "", "N.A.") (";0.7", ";", "", "0.7") (";4.5", ";", "", "4.5") (";", ";", "", "") ("\n03", "\n", "", "03") (";5.4.2", ";", "", "5.4.2") (";F", ";", "", "F") (";K", ";", "", "K") (";GOODBYE", ";", "", "GOODBYE") (";0: stay", ";", "", "0: stay") (";UINT8", ";", "", "UINT8") (";N.A.", ";", "", "N.A.") (";0.0", ";", "", "0.0") (";1.2", ";", "", "1.2") (";Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.", ";", "", "Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.") ("\n", "\n", "", "")
Regards
-
@Merlino said in QRegExp to parse a CSV file:
simple CSV file
If the file conforms to such format, you should have one and only one marker as field separator. Originally a comma (so the name) but later on some other character (that obviously cannot be part of the field values...)
@Gojir4
Can't you simply use line.split(";") ?Yes you can. @Gojir4 provided the right answer I guess. From you data example, in your case it seems to be a SCSV file indeed: a semi-colon separated values.
@Merlino about 2 hours ago
no because the string fields can contain punctuation and quotation marks so the simple split would be fooled.Yes, you'll have punctation and quotation marks in the string fields, but I bet none of such characters will be a semi-colon (;)
It looks like you're over-complicating your use case.