Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

Getting HTML source code of a web page



  • Hi guys. I am essentially a Python programmer and I recently made a full program and now I'm trying to convert it into Qt/C++. Unfortunately, one of its main functionalities is to get the HTML source code with some web page by using that:

    requests.get(url_string).text
    

    It works pretty fine and returns a string that I can easily parse with an algorithym I wrote myself. Now I'm struggling a lot with the C++ counterpart. I'm aware about the LibCurl/Curlpp projects, but I could not make any of them do what I want. I also did not found anything in the CPR documentation about that especific function or even being able to install it on QtCreator.
    So it seems the natural choice for this task is to use QNetworkAccess manager, but I could not make any of the codes I found online work, neither the one in the docs or the one from StackOverflow.
    Docs:

    QNetworkRequest request;
    request.setUrl(QUrl("http://qt-project.org"));
    request.setRawHeader("User-Agent", "MyOwnBrowser 1.0");
    
    QNetworkReply *reply = manager->get(request);
    connect(reply, &QIODevice::readyRead, this, &MyClass::slotReadyRead);
    connect(reply, QOverload<QNetworkReply::NetworkError>::of(&QNetworkReply::error),
            this, &MyClass::slotError);
    connect(reply, &QNetworkReply::sslErrors,
            this, &MyClass::slotSslErrors);
    

    StackOverflow:

    void htmlGet(const QUrl &url, const std::function<void(const QString&)> &fun) {
       QScopedPointer<QNetworkAccessManager> manager(new QNetworkAccessManager);
       QNetworkReply *response = manager->get(QNetworkRequest(QUrl(url)));
       QObject::connect(response, &QNetworkReply::finished, [response, fun]{
          response->deleteLater();
          response->manager()->deleteLater();
          if (response->error() != QNetworkReply::NoError) return;
          auto const contentType =
                response->header(QNetworkRequest::ContentTypeHeader).toString();
          static QRegularExpression re("charset=([!-~]+)");
          auto const match = re.match(contentType);
          if (!match.hasMatch() || 0 != match.captured(1).compare("utf-8", Qt::CaseInsensitive)) {
             qWarning() << "Content charsets other than utf-8 are not implemented yet:" << contentType;
             return;
          }
          auto const html = QString::fromUtf8(response->readAll());
          fun(html); // do something with the data
       }) && manager.take();
    }
    
    int main(int argc, char *argv[])
    {
       QCoreApplication app(argc, argv);
       htmlGet({"http://www.google.com"}, [](const QString &body){ qDebug() << body; qApp->quit(); });
       return app.exec();
    }
    

    That last one actually worked for http://www.google.com, but raised an qt.network.ssl: QSslSocket::connectToHostEncrypted: TLS initialization failed error when I tried with the website I actually want to use in.
    One thing I also tried to do was, using PyInstaller, making a .exe file from a Python Script that would simply get the URL string as an argv and save the HTML source code in the same directory, as well as printing it into the Standard Output. It worked pretty fine when called by CMD and PowerShell, even in other computers. It actually looked a lot like it was built as a simple C++ console program. Unfortunately, when I tried to put the GetHTML.exe inside the projects resource file and them call it using QProcess, nothing happened. Neither the file was written, nor the console was openned.
    So here I came for ask you for help. I really need to finish this project within the next four days. How can I get the URL Source Code, as a (Q)String from a web page giving its URL? Could I use aything for Python and still be able to use it in other programs without downloading the interpreter or I really need to make everything in C++?
    Ps: I use Qt Creator Community with Qt 5.14.2 on Windows 10 Home. The Python version is 3.7.7 or 3.8.2.


  • Qt Champions 2019

    @allangarcia2004 said in Getting HTML source code of a web page:

    Unfortunately, when I tried to put the GetHTML.exe inside the projects resource file and them call it using QProcess, nothing happened.

    Because it can't work this way. You need to extract the executable to some location in the file system (like temp directory) and start it from there. The OS does not know anything about Qt resources and can't start executables from there.

    "QSslSocket::connectToHostEncrypted: TLS initialization failed" - looks like you're trying to access HTTPS URLs. Please take a look at https://doc.qt.io/qt-5/ssl.html and provide openssl lib with your application.


Log in to reply