Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Using Libcurl C++ to help scrape web data
Forum Updated to NodeBB v4.3 + New Features

Using Libcurl C++ to help scrape web data

Scheduled Pinned Locked Moved Unsolved General and Desktop
13 Posts 3 Posters 2.4k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • eyllanescE Offline
    eyllanescE Offline
    eyllanesc
    wrote on last edited by
    #2

    @Laner107 If you already have the HTML and you want to extract the information of a tag then you must use an HTML parser, for example with a simple search in GH I could find https://github.com/lazytiger/gumbo-query

    If you want me to help you develop some work then you can write to my email: e.yllanescucho@gmal.com.

    L 1 Reply Last reply
    1
    • eyllanescE eyllanesc

      @Laner107 If you already have the HTML and you want to extract the information of a tag then you must use an HTML parser, for example with a simple search in GH I could find https://github.com/lazytiger/gumbo-query

      L Offline
      L Offline
      Laner107
      wrote on last edited by
      #3

      @eyllanesc Doesnt libcurl extract the website for me and redirects it or do i have to use a parser to get the information from what libcurl does?

      1 Reply Last reply
      0
      • eyllanescE Offline
        eyllanescE Offline
        eyllanesc
        wrote on last edited by
        #4

        @Laner107 mmm, libcurl is a library that makes HTTP requests.

        If you want me to help you develop some work then you can write to my email: e.yllanescucho@gmal.com.

        L 1 Reply Last reply
        0
        • eyllanescE eyllanesc

          @Laner107 mmm, libcurl is a library that makes HTTP requests.

          L Offline
          L Offline
          Laner107
          wrote on last edited by
          #5

          @eyllanesc So what exactly does those http requests do then?

          JonBJ 1 Reply Last reply
          0
          • L Laner107

            @eyllanesc So what exactly does those http requests do then?

            JonBJ Offline
            JonBJ Offline
            JonB
            wrote on last edited by JonB
            #6

            @Laner107
            libcurl will allow you to fetch the pages/content of the web site. You will get back the HTML of the pages. That does nothing toward parsing that content if, as per your example, you want to examine what is going on inside those pages, for which you will need some sort of HTML parser separately.

            If you are prepared to be very lazy and not at all accurate, you could use QRegularExpressions to get at parts of the HTML, e.g. to answer the question you pose. This is much simpler if you are just a "hobbyist", but is nowhere near robust if you need to be in any way accurate, then only an HTML parser will do you.

            L 1 Reply Last reply
            4
            • JonBJ JonB

              @Laner107
              libcurl will allow you to fetch the pages/content of the web site. You will get back the HTML of the pages. That does nothing toward parsing that content if, as per your example, you want to examine what is going on inside those pages, for which you will need some sort of HTML parser separately.

              If you are prepared to be very lazy and not at all accurate, you could use QRegularExpressions to get at parts of the HTML, e.g. to answer the question you pose. This is much simpler if you are just a "hobbyist", but is nowhere near robust if you need to be in any way accurate, then only an HTML parser will do you.

              L Offline
              L Offline
              Laner107
              wrote on last edited by Laner107
              #7

              @JonB I appreciate the detailed response and its starting to make more sense! My next question.. what html parser would you recommend to work well with libcurl in c++ then?

              Example: I will need to scrap from this element:
              <span class="marketDelta noChange">22.64</span>

              JonBJ 1 Reply Last reply
              0
              • L Laner107

                @JonB I appreciate the detailed response and its starting to make more sense! My next question.. what html parser would you recommend to work well with libcurl in c++ then?

                Example: I will need to scrap from this element:
                <span class="marketDelta noChange">22.64</span>

                JonBJ Offline
                JonBJ Offline
                JonB
                wrote on last edited by
                #8

                @Laner107
                I'm afraid I have no experience of what is available. HTML is notoriously difficult to parse. The example you give in itself is very easy to parse, the problem is getting through all the surrounding code leading up to it, with JavaScript dotted through it in practice on real web pages. Above @eyllanesc mentioned a library on github, I don't know whether that is suitable. Otherwise you can Google for HTML parser as well as I, it looks like Python tends to be preferred, I don't know what you'll find for C++.

                L 1 Reply Last reply
                0
                • JonBJ JonB

                  @Laner107
                  I'm afraid I have no experience of what is available. HTML is notoriously difficult to parse. The example you give in itself is very easy to parse, the problem is getting through all the surrounding code leading up to it, with JavaScript dotted through it in practice on real web pages. Above @eyllanesc mentioned a library on github, I don't know whether that is suitable. Otherwise you can Google for HTML parser as well as I, it looks like Python tends to be preferred, I don't know what you'll find for C++.

                  L Offline
                  L Offline
                  Laner107
                  wrote on last edited by
                  #9

                  @JonB Okay with that being said is there anyway to efficiently code a python widget and use it for a c++ gui?

                  JonBJ 1 Reply Last reply
                  0
                  • L Laner107

                    @JonB Okay with that being said is there anyway to efficiently code a python widget and use it for a c++ gui?

                    JonBJ Offline
                    JonBJ Offline
                    JonB
                    wrote on last edited by JonB
                    #10

                    @Laner107
                    I don't think you want want a Python GUI widget here, you would only want Python non-GUI code to do the parsing off the HTML string, passed into it from your C++ GUI code and returning its results to that.

                    3 possible approaches occur to me:

                    1. I know you can call C++ code from Python. I am less sure about calling Python code from C++ --- I think @Pablo-J-Rogina is your man for providing a link to a reference for that? It may just be the standard Python https://docs.python.org/3/extending/embedding.html?

                    2. If you do not need to call the Python parsing code frequently and can more have it "provide the answers to the parsed code all in one go", you might write the Python parsing in a standalone Python script and invoke it from your C++ GUI via Qt QProcess.

                    3. Is this HTML parsing the point of your whole GUI application? Have you already written lots of C++ UI code for it? Because by the time you write substantial portion of the parsing in Python, it might well be easiest to change your Qt UI itself to be written in PySide2 or PyQt5 instead of C++, and then the whole thing will be in Python and there will be no issue?

                    L 1 Reply Last reply
                    1
                    • JonBJ JonB

                      @Laner107
                      I don't think you want want a Python GUI widget here, you would only want Python non-GUI code to do the parsing off the HTML string, passed into it from your C++ GUI code and returning its results to that.

                      3 possible approaches occur to me:

                      1. I know you can call C++ code from Python. I am less sure about calling Python code from C++ --- I think @Pablo-J-Rogina is your man for providing a link to a reference for that? It may just be the standard Python https://docs.python.org/3/extending/embedding.html?

                      2. If you do not need to call the Python parsing code frequently and can more have it "provide the answers to the parsed code all in one go", you might write the Python parsing in a standalone Python script and invoke it from your C++ GUI via Qt QProcess.

                      3. Is this HTML parsing the point of your whole GUI application? Have you already written lots of C++ UI code for it? Because by the time you write substantial portion of the parsing in Python, it might well be easiest to change your Qt UI itself to be written in PySide2 or PyQt5 instead of C++, and then the whole thing will be in Python and there will be no issue?

                      L Offline
                      L Offline
                      Laner107
                      wrote on last edited by
                      #11

                      @JonB Other than parsing a potential stock market data that will be it for python and im trying to stick to C++ because im a current comp sci student and our classes are based around c++ so it helps give me a better understanding in the class as well.

                      JonBJ 1 Reply Last reply
                      0
                      • L Laner107

                        @JonB Other than parsing a potential stock market data that will be it for python and im trying to stick to C++ because im a current comp sci student and our classes are based around c++ so it helps give me a better understanding in the class as well.

                        JonBJ Offline
                        JonBJ Offline
                        JonB
                        wrote on last edited by
                        #12

                        @Laner107
                        As you please, but in that case perhaps you should be doing your HTML parsing in C++ not Python!

                        L 1 Reply Last reply
                        0
                        • JonBJ JonB

                          @Laner107
                          As you please, but in that case perhaps you should be doing your HTML parsing in C++ not Python!

                          L Offline
                          L Offline
                          Laner107
                          wrote on last edited by
                          #13

                          @JonB Ive actually been trying that the last 3 days, libcurl is giving me no luck, I have a request on stack overflow where a guy is attempting to get it working but for some reason my libcurl is not returning any html from the request, here is the link to the conversation stack overflow. Do you have any recommendations for parsing with c++? Ive been told its very difficult and definitely has posed a challenge.

                          1 Reply Last reply
                          0

                          • Login

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • Users
                          • Groups
                          • Search
                          • Get Qt Extensions
                          • Unsolved