Creating a web crawler type of application using available Qt libraries
-
wrote on 30 Apr 2020, 12:09 last edited by
Hello,
I am trying to write a web crawler application in Qt that can scrape given urls,parse the links found there and continue up to a given depth level. I am using QNetworkAccessManager, QnetworkRequest and QNetworkReply in order to do so. But when i get the reply and try to use the data i downloaded it seems to be empty.I will attach the code.
void Crawler::writeData() { QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender()); QByteArray read = reply->readAll(); QFile out("currentpage.txt"); out.open(QIODevice::WriteOnly | QIODevice::Text); out.write(read); out.close(); reply->close(); reply->deleteLater(); } void Crawler::searchInit() { QUrl myurl = QUrl(m_lnkQueue.front().c_str()); QNetworkReply* reply = manager->get(QNetworkRequest(QUrl(myurl))); connect(reply, &QNetworkReply::finished, this, &Crawler::writeData); } void Crawler::getData() { std::ifstream fin("currentpage.txt"); char buffer[1024]; if (fin.is_open()) { while (!fin.eof()) { fin.getline(buffer, 1024); m_currentPage.append(buffer); } } fin.close(); m_parsingPage = m_currentPage; }
These are the functions which seem to cause the problem. I would appreciate any help! Thanks in advance!
-
Hello,
I am trying to write a web crawler application in Qt that can scrape given urls,parse the links found there and continue up to a given depth level. I am using QNetworkAccessManager, QnetworkRequest and QNetworkReply in order to do so. But when i get the reply and try to use the data i downloaded it seems to be empty.I will attach the code.
void Crawler::writeData() { QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender()); QByteArray read = reply->readAll(); QFile out("currentpage.txt"); out.open(QIODevice::WriteOnly | QIODevice::Text); out.write(read); out.close(); reply->close(); reply->deleteLater(); } void Crawler::searchInit() { QUrl myurl = QUrl(m_lnkQueue.front().c_str()); QNetworkReply* reply = manager->get(QNetworkRequest(QUrl(myurl))); connect(reply, &QNetworkReply::finished, this, &Crawler::writeData); } void Crawler::getData() { std::ifstream fin("currentpage.txt"); char buffer[1024]; if (fin.is_open()) { while (!fin.eof()) { fin.getline(buffer, 1024); m_currentPage.append(buffer); } } fin.close(); m_parsingPage = m_currentPage; }
These are the functions which seem to cause the problem. I would appreciate any help! Thanks in advance!
wrote on 30 Apr 2020, 13:58 last edited by Pl45m4@undac said in Creating a web crawler type of application using available Qt libraries:
QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender());
This is not the best solution, but your code works for me.
I've tested your
writeData()
andsearchInit()
. My filecurrentpage.txt
wasn't empty. I used this page for testing and got 39.4 kB of data. Does your file have 0 Bytes? Otherwise there is something wrong with yourgetData()
function, which I would revise tbh. Why you useQFile
/QDir
to write the file and then standard c++ stuff to read it in again? -
wrote on 30 Apr 2020, 14:52 last edited by
I ommited to say that separately these functions work but i am using them in a loop to get all the links. And it doesnt download them with every iteration of the while loop.
while (!m_lnkQueue.empty() && m_depth <= wantedLevel) { this->searchInit(); this->getData(); if (m_currentPage!= "" && m_currentPage.find("div") != -1) { this->parseLinks(); this->saveToFile(); } else { m_lnkQueue.pop_front(); } }
The other 2 functions shown here are just modifying the data and do not mess with the file
-
wrote on 30 Apr 2020, 14:56 last edited by
And after the while loop completes, only then i can see it writes the information the the file as if it would be a delay.
-
I ommited to say that separately these functions work but i am using them in a loop to get all the links. And it doesnt download them with every iteration of the while loop.
while (!m_lnkQueue.empty() && m_depth <= wantedLevel) { this->searchInit(); this->getData(); if (m_currentPage!= "" && m_currentPage.find("div") != -1) { this->parseLinks(); this->saveToFile(); } else { m_lnkQueue.pop_front(); } }
The other 2 functions shown here are just modifying the data and do not mess with the file
wrote on 30 Apr 2020, 15:30 last edited by@undac said in Creating a web crawler type of application using available Qt libraries:
this->searchInit();
this->getData();This doen't ensures that your
QNetworkReply
was written completely to your file, before accessing the file again to read its content. So this could cause your delay. -
wrote on 30 Apr 2020, 15:37 last edited by
And how could I resume the loop only after the download would be complete?
-
wrote on 1 May 2020, 07:34 last edited by
I finally solved the problem. The download it's done
asynchronous and it would be done only after the execution of the loop. The solution was to call the functions recursively,getData from writeData( in which i changed the way the file was read to a qt method), parseLinks from getData, saveToFile from parseLinks and then go back to searchInit
1/7