Extract parts from webpage

realroot · wrote on 20 Oct 2024, 19:50

I can read the webpage as html:

if (reply->error() == QNetworkReply::NoError) {
   QString htmlContent = reply->readAll();
}

I'd like to download some image and text from the html or somehow:


<!DOCTYPE html>
<html lang="en-gb">
<head>
[...]
<div class="infobox">
<div class="infobox-map"><img src="/images/things/map/code-of-abs.jpg" alt="SITE Things: What it Works"></div>
<div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
<div class="infobox-focus"><img src="/images/infobox/type-abs.jpg" alt=""></div>
<div class="infobox-difficulty"><img src="/images/infobox/difficulty-5.jpg" alt=""></div>
<div class="infomore">
<div class="infotext">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at orci nibh. Phasellus quis risus rhoncus, pellentesque elit vel, egestas ligula. Ut vehicula tellus quis sem consequat vehicula. Ut convallis eget odio sed mollis. Ut faucibus felis sed laoreet mollis. Nam lobortis cursus dignissim .</div>
<div class="infoec"><strong>Extra Credit:</strong> 30 seconds rest between sets. </div>
</div>

I searched if there is some way to do this but I did not find anything.
To start I tried to extract the infotext section with:

const QRegularExpression MyClass::descrRegex("<div class=\\\"infotext\\\">(.*?)</div>", QRegularExpression::DotMatchesEverythingOption | QRegularExpression::MultilineOption);

But I do not have any result.
I do not much about regex so I may done something wrong.

JonB · 21 Oct 2024, 19:04

@realroot
If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

For picking out the filepath in a regular expression you will want something like
<img src="([^"]*)"
The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

ankou29666 · 21 Oct 2024, 08:00

Hi

my bet would be parsing the document with QXmlStreamReader and QXmlStreamAttributes (if using Qt6, as the classes are completely different between 5 and 6, so when browsing the documentation for Qt6, be careful checking that you're viewing classes for Qt6 and not Qt5 compatibility module)

I'm not absolutely certain this is going to work with Qt, but in browser/JS i've always seen HTML parsing done with XML parsers, HTML being more or less supposed to be XML. Or XML compliant at least.

You would extract the objects, for example <img ...> tags, and you should find some classes with methods to extract the values for any key you would like.

JonB · wrote on 21 Oct 2024, 08:00

@ankou29666
I would be "suprised" if a (decent) XML parser could parse HTML. I have not seen HTML parsing done with XML parsers, HTML it not supposed to be XML, and it is not "XML compliant". IIRC there is a standard called XHTML which is or claims to be XML-compliant, but I don't recall how I fared with it years ago, and the source has to be written as XHTML.

Most "real" HTML pages are sprinkled with JS blocks and similar and I don't see how XML would deal with that. HTML commonly uses standalone constructs like <br> or <hr> without any closing tag. And I believe you can have/editors can produce <b><i> ... </b></i> which is accepted in HTML. Posts like https://stackoverflow.com/questions/32572928/parsing-an-html-document-using-an-xml-parser discuss this and mention that you really need a third-party library to create a web-crawler application. There may be more up-to-date answers. I should be surprised if the Qt XML parser copes with HTML.

Of course I may be wrong and it does no harm to try, but don't hold your breath :)

If all you want is, say, to recognise <img src="..."> tags and extract a path from it AND you are fault-tolerant --- you don't mind missing some images and you don't mind picking up something as an image which isn't or is commented out or whatever --- then you may get a limited way with regular expressions parsing.

ankou29666 · 21 Oct 2024, 09:54

I don't remember where I've seen this, and I'm even wondering whether I might have done this by myself by the past. but yeah I think i'm getting a little mistaken between HTML and XHTML, i'm probably talking about the latter.
and yeah you're right about the standalone tags that don't have a closing tag in HTML.

What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

JonB · wrote on 21 Oct 2024, 09:54

@ankou29666 said in Extract parts from webpage:

What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

:) In between <script> and </script> you can have essentially anything over many lines. Probably anything you like except </script>. An HTML parser must know to do something like allow anything inside <script>. Where does XML have a tag like <script> which allows absolutely any arbitrary non-XML/HTML (not to mention what would it do to put that into an XML tree), and just pick up again at </script>? :)

Btw, XML has <![CDATA[ ... ]]> for arbitrary text insert, but that's very different syntax from <script> ... </script>.

ankou29666 · 21 Oct 2024, 10:55

XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

JonB · wrote on 21 Oct 2024, 10:55

@ankou29666 said in Extract parts from webpage:

Yeah. and what's the problem with this ?

< and & characters, at least, in the <script>/JS area, e.g.

if (a & b < c)
    document.write("if (a & b < c)");

Use of CDATA in program output.

artwaw · wrote on 21 Oct 2024, 14:11

I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
I think it really depends what you expect to extract.

realroot · wrote on 21 Oct 2024, 15:56

The data I fetch should be always inside the same "blocks", another example:

<div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>

So "/images/infobox/focus-abs.jpg".
I think that regex can do this but it didn't work.

I will look QDomDocument if that can parse html.

Christian Ehrlicher · wrote on 21 Oct 2024, 15:58

Why don't you simply search for <img src= then?

realroot · 21 Oct 2024, 17:14

Search with QRegularExpression? Could you clarify?

Christian Ehrlicher · wrote on 21 Oct 2024, 17:14

@realroot said in Extract parts from webpage:

Search with QRegularExpression?

Why do you need a regexp when you want to search for a simple string?

JonB · 21 Oct 2024, 19:04

@realroot
If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

For picking out the filepath in a regular expression you will want something like
<img src="([^"]*)"
The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

SGaist · J JonB 21 Oct 2024, 18:20

The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

realroot · wrote on 22 Oct 2024, 17:48

I did not think about indexOf().
I can use that indeed.

To save jpg or pdf can I use a QPixmap?

void onFinished(QNetworkReply *reply) {
    ...
    QPixmap pm;
    pm.loadFromData(reply->readAll());

SGaist · wrote on 22 Oct 2024, 18:51

What does your reply contain ? If it's the image data, then write it directly to a file.

realroot · wrote on 22 Oct 2024, 19:42

I still have to try it should be:

QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));

So I think it is.

I use QTextStream for text not sure to how handle images.

SGaist · wrote on 23 Oct 2024, 05:34

So these are binary data, juste use QFile to write them to disk directly.