Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

Creating a web crawler type of application using available Qt libraries



  • Hello,

    I am trying to write a web crawler application in Qt that can scrape given urls,parse the links found there and continue up to a given depth level. I am using QNetworkAccessManager, QnetworkRequest and QNetworkReply in order to do so. But when i get the reply and try to use the data i downloaded it seems to be empty.I will attach the code.

    void Crawler::writeData()
    {
    	QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender());
    	QByteArray read = reply->readAll();
    	QFile out("currentpage.txt");
    	out.open(QIODevice::WriteOnly | QIODevice::Text);
    	out.write(read);
    	out.close();
    	reply->close();
    	reply->deleteLater();
    }
    void Crawler::searchInit()
    {
    	QUrl myurl = QUrl(m_lnkQueue.front().c_str());
    	QNetworkReply* reply = manager->get(QNetworkRequest(QUrl(myurl)));
    	connect(reply, &QNetworkReply::finished, this, &Crawler::writeData);
    	
    }
    void Crawler::getData()
    {
    	std::ifstream fin("currentpage.txt");
    	char buffer[1024];
    	if (fin.is_open())
    	{
    		while (!fin.eof())
    		{
    			fin.getline(buffer, 1024);
    			m_currentPage.append(buffer);
    		}
    	}
    	fin.close();
    	m_parsingPage = m_currentPage;
    }
    
    

    These are the functions which seem to cause the problem. I would appreciate any help! Thanks in advance!



  • @undac said in Creating a web crawler type of application using available Qt libraries:

    QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender());

    This is not the best solution, but your code works for me.

    I've tested your writeData() and searchInit(). My file currentpage.txt wasn't empty. I used this page for testing and got 39.4 kB of data. Does your file have 0 Bytes? Otherwise there is something wrong with your getData() function, which I would revise tbh. Why you use QFile / QDir to write the file and then standard c++ stuff to read it in again?



  • I ommited to say that separately these functions work but i am using them in a loop to get all the links. And it doesnt download them with every iteration of the while loop.

    	while (!m_lnkQueue.empty() && m_depth <= wantedLevel)
    	{
    		this->searchInit();
    		this->getData();
    		if (m_currentPage!= "" && m_currentPage.find("div") != -1)
    		{
    			this->parseLinks();
    
    			this->saveToFile();
    		}
    		else
    		{
    			m_lnkQueue.pop_front();
    		}
    
    	}
    

    The other 2 functions shown here are just modifying the data and do not mess with the file



  • And after the while loop completes, only then i can see it writes the information the the file as if it would be a delay.



  • @undac said in Creating a web crawler type of application using available Qt libraries:

    this->searchInit();
    this->getData();

    This doen't ensures that your QNetworkReply was written completely to your file, before accessing the file again to read its content. So this could cause your delay.



  • And how could I resume the loop only after the download would be complete?



  • I finally solved the problem. The download it's done
    asynchronous and it would be done only after the execution of the loop. The solution was to call the functions recursively,getData from writeData( in which i changed the way the file was read to a qt method), parseLinks from getData, saveToFile from parseLinks and then go back to searchInit


Log in to reply