Get all urls in a text file



  • I have a text file with texts and urls, I want to get all the urls in that file, how can I do that using Qt?


  • Qt Champions 2016

    hi
    is it just a list of urls or is the url mixed with other type of text?
    Are you asking how you can parse them or how you would read the text file?
    Can you show some lines from the file?
    You can read all url lines by line this way

    QFile inputFile(fileName);
    if (inputFile.open(QIODevice::ReadOnly))
    {
       QTextStream in(&inputFile);
       while (!in.atEnd())
       {
          QString line = in.readLine();
          ...
       }
       inputFile.close();
    }
    


  • It's a mixed text with urls. I think that the way you pointed out might have performance problem, am I wrong?


  • Qt Champions 2016

    well it reads one line at a time if that is what you mean.
    but it all depends how your text file is structured.
    if text are not neatly on lines (\n), reading it as lines is pointless.


  • Lifetime Qt Champion

    Hi,

    You can also load the content of your file completely and then run a search through it using QRegularExpression



  • @mrjj Actually the application doesn't need to know if it has lines or not, I just need to get all the links.

    I think that I will use regex: https://gist.github.com/dperini/729294

    @SGaist I saw your answer before posting, but yes, I think that in this case it's better to use regex.


  • Qt Champions 2016

    @yodusow-bardon
    Ok, so its like a dump.
    That is one nice RegularExpression ;)



  • @mrjj I just realized that this one isn't working with Qt. I'm getting a warning:

    QRegularExpressionPrivate::doMatch(): called on an invalid QRegularExpression object

    I will try to find other like this or make this one to work. - If you have one, I will accept too. haha.


  • Qt Champions 2016

    @yodusow-bardon
    Hi
    The actual expression should still work with the QRegularExpression Class ?
    seems just to add strings using + to make it more readable.
    "(?:(?:https?|ftp)://)" + "(?:\S+(?::\S*)?@)?" ...
    so you can easy convert to Qt , i think.
    or?



  • @mrjj That is how I'm doing it:

    QRegularExpression re(
      "^"
      // protocol identifier
      "(?:(?:https?|ftp)://)"
      // user:pass authentication
      "(?:\\S+(?::\\S*)?@)?"
      "(?:"
      // IP address exclusion
      // private & local networks
      "(?!(?:10|127)(?:\\.\\d{1,3}){3})"
      "(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})"
      "(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})"
      // IP address dotted notation octets
      // excludes loopback network 0.0.0.0
      // excludes reserved space >= 224.0.0.0
      // excludes network & broacast addresses
      // (first & last IP address of each class)
      "(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])"
      "(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}"
      "(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))"
      "|"
      // host name
      "(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)"
      // domain name
      "(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*"
      // TLD identifier
      "(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))"
      // TLD may end with dot
      "\\.?"
      ")"
      // port number
      "(?::\\d{2,5})?"
      // resource path
      "(?:[/?#]\\S*)?"
      "$"
    );
      
    re.setPatternOptions(QRegularExpression::MultilineOption |
                       QRegularExpression::DotMatchesEverythingOption |
                       QRegularExpression::CaseInsensitiveOption);
    
    auto match = re.match(text);
    if ( match.hasMatch()) {
      qDebug() << match.captured(0);
    } else {
      qDebug() << "Nothing found";
    }

Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.