Web crawler

  • How to create a web crawler?


    The application search on google, handle all listed sites and access one by one and get something.

  • You might want to start with "QNetworkAccessManager":http://qt-project.org/doc/qt-4.8/QNetworkAccessManager.html
    QNetworkAccessManager manager;
    QString searchWord = "Hello";
    QString request = "http://www.google.com.ua/#hl=en&output=search&q=" + searchWord;
    After this you have to parse the result. That's the most difficult.

  • I've done some research on this in January.

    First, QNetworkAccessManager is no solution, as it seems, as its a good HTTP source.
    But, you have to put the received content in a browser like enviroment, also parsing HTML is not really trivial, there is a tagsoup implementation which would do, but you got the problem, that some links are generated through javascript, so you really need to put that in a browser like thing -> QtWebKit.

    QtWebKit offers a lot of good stuff which you can use to crawl, f.e. it can extract all <a> tags (aka links).
    But, the problem here is, QtWebKit is not threadsafe, so you'd have to handle multiple Processes doing the work, in order to speed up the process.

