Web crawler



  • How to create a web crawler?

    Example:

    The application search on google, handle all listed sites and access one by one and get something.



  • You might want to start with "QNetworkAccessManager":http://qt-project.org/doc/qt-4.8/QNetworkAccessManager.html
    @
    QNetworkAccessManager manager;
    QString searchWord = "Hello";
    QString request = "http://www.google.com.ua/#hl=en&output=search&q=" + searchWord;
    manager.get(QNetworkRequest(QUrl(request));
    @
    After this you have to parse the result. That's the most difficult.



  • I've done some research on this in January.

    First, QNetworkAccessManager is no solution, as it seems, as its a good HTTP source.
    But, you have to put the received content in a browser like enviroment, also parsing HTML is not really trivial, there is a tagsoup implementation which would do, but you got the problem, that some links are generated through javascript, so you really need to put that in a browser like thing -> QtWebKit.

    QtWebKit offers a lot of good stuff which you can use to crawl, f.e. it can extract all <a> tags (aka links).
    But, the problem here is, QtWebKit is not threadsafe, so you'd have to handle multiple Processes doing the work, in order to speed up the process.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.