Multithreaded Webcrawler

arvind2111

Hi there,

I'm trying to build a multithreaded webcrawler which downloads a page's HTML and all resources it references. Given that QtWebPage isn't threadsafe, I'm wondering what would be the best way of accomplishing this?

Things I've tried:

Have each thread start it's own QApplication, but that gives me a "There can only exist one QCoreApplication instance" error.
Creating the QWebPage in the main (GUI) thread and moving it to a delegate thread, but that gives me a "QObject used from outside its own thread" error.

Any pointers/direction would be greatly appreciated!

-Arvind

giesbert

Hi,

Q(Core)Application is a process singleton. It may only exist once in a process, and should be instantiated inside main()

Why didn't you create the QWebPage inside the run method of the thread?

arvind2111

Thanks for the reply! That was my initial approach which worked fine when I restricted the app to only one child thread but, on lifting this restriction, the app would segfault. Googling around led to these forum posts that suggested QtWebKit was not thread safe, and could only be instantiated in the main/GUI thread. Is this not right?

http://developer.qt.nokia.com/forums/viewthread/9035
http://developer.qt.nokia.com/forums/viewthread/3005

goetz

Using QWebKit for a web crawler sounds quite overdosed for that task. At least for the definition of "web crawler" (cf. wget -r) that I usually have...

KA51O

I guess its more like enter a website adresses like www.BigCompany.com and go through the page and all the pages it links to and for example collect all the e-mail adresses.