Extract Emails from QString using Regex
-
I am trying to extract some emails from text in QString have many words some of it emails so I think to use regex to do that I use this code
int main() { string s; while (getline(cin, s)) { QRegExp rx("^[0-9a-zA-Z]+([0-9a-zA-Z]*[-._+])*[0-9a-zA-Z]+@[0-9a-zA-Z]+([-.][0-9a-zA-Z]+)*([0-9a-zA-Z]*[.])[a-zA-Z]{2,6}$"); // create the regular expression rx.indexIn(QString::fromStdString(s)); QStringList l = rx.capturedTexts(); cout << "size = " << l.size() << endl; for (QString i : l) { if (!i.isEmpty()) { qDebug() << i; } } //check if it work on only one word using QRegular Expression /*QRegularExpression reg("^[0-9a-zA-Z]+([0-9a-zA-Z]*[-._+])*[0-9a-zA-Z]+@[0-9a-zA-Z]+([-.][0-9a-zA-Z]+)*([0-9a-zA-Z]*[.])[a-zA-Z]{2,6}$"); if (reg.match(QString::fromStdString(s)).hasMatch()) { puts("YES valid"); } else { puts("Not Email"); }*/ } }
but it have some problems First I use the commented code to check if it work on just an email address alone like those to check that the regex I use works on emails
dsads@gmail.com habobo887@gmail.com
and it works no problem the problem is when I try to add a full string and use the uncommented code to extract the emails from it like this
hello iam here this a full article that may contains emails like this hopeitworks@gmail.com ...etc
it not work it output nothing the size of the QstringList is 4 but all strings in it is empty
if I enter only the email it outputsize = 4 "hopeitworks@gmail.com" "mail." // !! this wrong output
so what I should do to fix this and only get in the QStringList only the emails in the text I don't know where is the problem if it's in the regex I use or what
Thanks in advance -
First, I suspect your regex is not what you really want to capture, and second, I think you are misunderstanding what QRegExp::capturedTexts() returns (though I could be wrong on one or both counts, of course!).
First, "^[0-9a-zA-Z]+" matches the first word of the line, whatever that is. I'm not sure why in a general text string you would want to do that. In particular, I don't really see why you would want the caret, which makes that string only match the first word of the string. I think you probably want to drop that entire term.
After that I think you are probably just overcomplicating matters. To get a Regex that literally matches 100% of valid emails is indeed complex (you can find the answer via Google, you are looking for a regex that matches RFC 5322 Section 3.4.1), but one that will get you almost all practical email addresses is "[0-9a-zA-Z+-_.]+@[0-9a-zA-Z-]+.[0-9a-zA-Z-.]+". In that case capturedTexts() should just give you a list of email addresses, since there are no sub-expressions to deal with.