Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Creating a web crawler type of application using available Qt libraries
Forum Update on Monday, May 27th 2025

Creating a web crawler type of application using available Qt libraries

Scheduled Pinned Locked Moved Solved General and Desktop
7 Posts 2 Posters 596 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U Offline
    U Offline
    undac
    wrote on 30 Apr 2020, 12:09 last edited by
    #1

    Hello,

    I am trying to write a web crawler application in Qt that can scrape given urls,parse the links found there and continue up to a given depth level. I am using QNetworkAccessManager, QnetworkRequest and QNetworkReply in order to do so. But when i get the reply and try to use the data i downloaded it seems to be empty.I will attach the code.

    void Crawler::writeData()
    {
    	QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender());
    	QByteArray read = reply->readAll();
    	QFile out("currentpage.txt");
    	out.open(QIODevice::WriteOnly | QIODevice::Text);
    	out.write(read);
    	out.close();
    	reply->close();
    	reply->deleteLater();
    }
    void Crawler::searchInit()
    {
    	QUrl myurl = QUrl(m_lnkQueue.front().c_str());
    	QNetworkReply* reply = manager->get(QNetworkRequest(QUrl(myurl)));
    	connect(reply, &QNetworkReply::finished, this, &Crawler::writeData);
    	
    }
    void Crawler::getData()
    {
    	std::ifstream fin("currentpage.txt");
    	char buffer[1024];
    	if (fin.is_open())
    	{
    		while (!fin.eof())
    		{
    			fin.getline(buffer, 1024);
    			m_currentPage.append(buffer);
    		}
    	}
    	fin.close();
    	m_parsingPage = m_currentPage;
    }
    
    

    These are the functions which seem to cause the problem. I would appreciate any help! Thanks in advance!

    P 1 Reply Last reply 30 Apr 2020, 13:58
    0
    • U undac
      30 Apr 2020, 12:09

      Hello,

      I am trying to write a web crawler application in Qt that can scrape given urls,parse the links found there and continue up to a given depth level. I am using QNetworkAccessManager, QnetworkRequest and QNetworkReply in order to do so. But when i get the reply and try to use the data i downloaded it seems to be empty.I will attach the code.

      void Crawler::writeData()
      {
      	QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender());
      	QByteArray read = reply->readAll();
      	QFile out("currentpage.txt");
      	out.open(QIODevice::WriteOnly | QIODevice::Text);
      	out.write(read);
      	out.close();
      	reply->close();
      	reply->deleteLater();
      }
      void Crawler::searchInit()
      {
      	QUrl myurl = QUrl(m_lnkQueue.front().c_str());
      	QNetworkReply* reply = manager->get(QNetworkRequest(QUrl(myurl)));
      	connect(reply, &QNetworkReply::finished, this, &Crawler::writeData);
      	
      }
      void Crawler::getData()
      {
      	std::ifstream fin("currentpage.txt");
      	char buffer[1024];
      	if (fin.is_open())
      	{
      		while (!fin.eof())
      		{
      			fin.getline(buffer, 1024);
      			m_currentPage.append(buffer);
      		}
      	}
      	fin.close();
      	m_parsingPage = m_currentPage;
      }
      
      

      These are the functions which seem to cause the problem. I would appreciate any help! Thanks in advance!

      P Offline
      P Offline
      Pl45m4
      wrote on 30 Apr 2020, 13:58 last edited by Pl45m4
      #2

      @undac said in Creating a web crawler type of application using available Qt libraries:

      QNetworkReply* reply = qobject_cast<QNetworkReply*>(sender());

      This is not the best solution, but your code works for me.

      I've tested your writeData() and searchInit(). My file currentpage.txt wasn't empty. I used this page for testing and got 39.4 kB of data. Does your file have 0 Bytes? Otherwise there is something wrong with your getData() function, which I would revise tbh. Why you use QFile / QDir to write the file and then standard c++ stuff to read it in again?


      If debugging is the process of removing software bugs, then programming must be the process of putting them in.

      ~E. W. Dijkstra

      1 Reply Last reply
      0
      • U Offline
        U Offline
        undac
        wrote on 30 Apr 2020, 14:52 last edited by
        #3

        I ommited to say that separately these functions work but i am using them in a loop to get all the links. And it doesnt download them with every iteration of the while loop.

        	while (!m_lnkQueue.empty() && m_depth <= wantedLevel)
        	{
        		this->searchInit();
        		this->getData();
        		if (m_currentPage!= "" && m_currentPage.find("div") != -1)
        		{
        			this->parseLinks();
        
        			this->saveToFile();
        		}
        		else
        		{
        			m_lnkQueue.pop_front();
        		}
        
        	}
        

        The other 2 functions shown here are just modifying the data and do not mess with the file

        P 1 Reply Last reply 30 Apr 2020, 15:30
        0
        • U Offline
          U Offline
          undac
          wrote on 30 Apr 2020, 14:56 last edited by
          #4

          And after the while loop completes, only then i can see it writes the information the the file as if it would be a delay.

          1 Reply Last reply
          0
          • U undac
            30 Apr 2020, 14:52

            I ommited to say that separately these functions work but i am using them in a loop to get all the links. And it doesnt download them with every iteration of the while loop.

            	while (!m_lnkQueue.empty() && m_depth <= wantedLevel)
            	{
            		this->searchInit();
            		this->getData();
            		if (m_currentPage!= "" && m_currentPage.find("div") != -1)
            		{
            			this->parseLinks();
            
            			this->saveToFile();
            		}
            		else
            		{
            			m_lnkQueue.pop_front();
            		}
            
            	}
            

            The other 2 functions shown here are just modifying the data and do not mess with the file

            P Offline
            P Offline
            Pl45m4
            wrote on 30 Apr 2020, 15:30 last edited by
            #5

            @undac said in Creating a web crawler type of application using available Qt libraries:

            this->searchInit();
            this->getData();

            This doen't ensures that your QNetworkReply was written completely to your file, before accessing the file again to read its content. So this could cause your delay.


            If debugging is the process of removing software bugs, then programming must be the process of putting them in.

            ~E. W. Dijkstra

            1 Reply Last reply
            0
            • U Offline
              U Offline
              undac
              wrote on 30 Apr 2020, 15:37 last edited by
              #6

              And how could I resume the loop only after the download would be complete?

              1 Reply Last reply
              0
              • U Offline
                U Offline
                undac
                wrote on 1 May 2020, 07:34 last edited by
                #7

                I finally solved the problem. The download it's done
                asynchronous and it would be done only after the execution of the loop. The solution was to call the functions recursively,getData from writeData( in which i changed the way the file was read to a qt method), parseLinks from getData, saveToFile from parseLinks and then go back to searchInit

                1 Reply Last reply
                0

                1/7

                30 Apr 2020, 12:09

                • Login

                • Login or register to search.
                1 out of 7
                • First post
                  1/7
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • Users
                • Groups
                • Search
                • Get Qt Extensions
                • Unsolved