Using Libcurl C++ to help scrape web data
-
Hello I am having trouble scraping webdata using libcurl as I cant seem to find the correct html content to scrape the data from the text that is under "react-id: 140". I know if I chose "140" for the webscrapper to find it will return the react id but not the text behind it. How can i go about getting the text that this <span> has in it. Code and pictures below.
finding html text(the attempt)
size_t indexTCash = html.find("140"); string tCash = html.substr(indexTCash); totalCash = stod(tCash);
Picture of html inspect that i need
-
@Laner107
libcurl
will allow you to fetch the pages/content of the web site. You will get back the HTML of the pages. That does nothing toward parsing that content if, as per your example, you want to examine what is going on inside those pages, for which you will need some sort of HTML parser separately.If you are prepared to be very lazy and not at all accurate, you could use
QRegularExpression
s to get at parts of the HTML, e.g. to answer the question you pose. This is much simpler if you are just a "hobbyist", but is nowhere near robust if you need to be in any way accurate, then only an HTML parser will do you. -
@JonB I appreciate the detailed response and its starting to make more sense! My next question.. what html parser would you recommend to work well with libcurl in c++ then?
Example: I will need to scrap from this element:
<span class="marketDelta noChange">22.64</span> -
@Laner107
I'm afraid I have no experience of what is available. HTML is notoriously difficult to parse. The example you give in itself is very easy to parse, the problem is getting through all the surrounding code leading up to it, with JavaScript dotted through it in practice on real web pages. Above @eyllanesc mentioned a library on github, I don't know whether that is suitable. Otherwise you can Google forHTML parser
as well as I, it looks like Python tends to be preferred, I don't know what you'll find for C++. -
@Laner107
I don't think you want want a Python GUI widget here, you would only want Python non-GUI code to do the parsing off the HTML string, passed into it from your C++ GUI code and returning its results to that.3 possible approaches occur to me:
-
I know you can call C++ code from Python. I am less sure about calling Python code from C++ --- I think @Pablo-J-Rogina is your man for providing a link to a reference for that? It may just be the standard Python https://docs.python.org/3/extending/embedding.html?
-
If you do not need to call the Python parsing code frequently and can more have it "provide the answers to the parsed code all in one go", you might write the Python parsing in a standalone Python script and invoke it from your C++ GUI via Qt
QProcess
. -
Is this HTML parsing the point of your whole GUI application? Have you already written lots of C++ UI code for it? Because by the time you write substantial portion of the parsing in Python, it might well be easiest to change your Qt UI itself to be written in PySide2 or PyQt5 instead of C++, and then the whole thing will be in Python and there will be no issue?
-
-
@JonB Other than parsing a potential stock market data that will be it for python and im trying to stick to C++ because im a current comp sci student and our classes are based around c++ so it helps give me a better understanding in the class as well.
-
@JonB Ive actually been trying that the last 3 days, libcurl is giving me no luck, I have a request on stack overflow where a guy is attempting to get it working but for some reason my libcurl is not returning any html from the request, here is the link to the conversation stack overflow. Do you have any recommendations for parsing with c++? Ive been told its very difficult and definitely has posed a challenge.