Multithreaded Webcrawler



  • Hi there,

    I'm trying to build a multithreaded webcrawler which downloads a page's HTML and all resources it references. Given that QtWebPage isn't threadsafe, I'm wondering what would be the best way of accomplishing this?

    Things I've tried:

    • Have each thread start it's own QApplication, but that gives me a "There can only exist one QCoreApplication instance" error.
    • Creating the QWebPage in the main (GUI) thread and moving it to a delegate thread, but that gives me a "QObject used from outside its own thread" error.

    Any pointers/direction would be greatly appreciated!

    -Arvind



  • Hi,

    Q(Core)Application is a process singleton. It may only exist once in a process, and should be instantiated inside main()

    Why didn't you create the QWebPage inside the run method of the thread?



  • Thanks for the reply! That was my initial approach which worked fine when I restricted the app to only one child thread but, on lifting this restriction, the app would segfault. Googling around led to these forum posts that suggested QtWebKit was not thread safe, and could only be instantiated in the main/GUI thread. Is this not right?

    http://developer.qt.nokia.com/forums/viewthread/9035
    http://developer.qt.nokia.com/forums/viewthread/3005



  • Using QWebKit for a web crawler sounds quite overdosed for that task. At least for the definition of "web crawler" (cf. wget -r) that I usually have...



  • I guess its more like enter a website adresses like www.BigCompany.com and go through the page and all the pages it links to and for example collect all the e-mail adresses.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.