Extract Emails from QString using Regex



  • I am trying to extract some emails from text in QString have many words some of it emails so I think to use regex to do that I use this code

    int main() {
    
      string s;
      while (getline(cin, s)) {
      QRegExp rx("^[0-9a-zA-Z]+([0-9a-zA-Z]*[-._+])*[0-9a-zA-Z]+@[0-9a-zA-Z]+([-.][0-9a-zA-Z]+)*([0-9a-zA-Z]*[.])[a-zA-Z]{2,6}$"); // create the regular expression
      rx.indexIn(QString::fromStdString(s));
      QStringList l = rx.capturedTexts();
      cout << "size = " << l.size() << endl;
      for (QString i : l) {
      if (!i.isEmpty()) {
      qDebug() << i;
      }
      }
    
      //check if it work on only one word using QRegular Expression
      /*QRegularExpression reg("^[0-9a-zA-Z]+([0-9a-zA-Z]*[-._+])*[0-9a-zA-Z]+@[0-9a-zA-Z]+([-.][0-9a-zA-Z]+)*([0-9a-zA-Z]*[.])[a-zA-Z]{2,6}$");
      if (reg.match(QString::fromStdString(s)).hasMatch()) {
      puts("YES valid");
      }
      else {
      puts("Not Email");
      }*/
      }
    }
    

    but it have some problems First I use the commented code to check if it work on just an email address alone like those to check that the regex I use works on emails

    dsads@gmail.com
    habobo887@gmail.com
    

    and it works no problem the problem is when I try to add a full string and use the uncommented code to extract the emails from it like this

    hello iam here this a full article that may contains emails like this hopeitworks@gmail.com ...etc
    

    it not work it output nothing the size of the QstringList is 4 but all strings in it is empty
    if I enter only the email it output

    size = 4
    "hopeitworks@gmail.com"
    "mail." // !! this wrong output 
    

    so what I should do to fix this and only get in the QStringList only the emails in the text I don't know where is the problem if it's in the regex I use or what
    Thanks in advance



  • First, I suspect your regex is not what you really want to capture, and second, I think you are misunderstanding what QRegExp::capturedTexts() returns (though I could be wrong on one or both counts, of course!).

    First, "^[0-9a-zA-Z]+" matches the first word of the line, whatever that is. I'm not sure why in a general text string you would want to do that. In particular, I don't really see why you would want the caret, which makes that string only match the first word of the string. I think you probably want to drop that entire term.

    After that I think you are probably just overcomplicating matters. To get a Regex that literally matches 100% of valid emails is indeed complex (you can find the answer via Google, you are looking for a regex that matches RFC 5322 Section 3.4.1), but one that will get you almost all practical email addresses is "[0-9a-zA-Z+-_.]+@[0-9a-zA-Z-]+.[0-9a-zA-Z-.]+". In that case capturedTexts() should just give you a list of email addresses, since there are no sub-expressions to deal with.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.