Extract parts from webpage

ankou29666

XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

JonB

@ankou29666 said in Extract parts from webpage:

Yeah. and what's the problem with this ?

< and & characters, at least, in the <script>/JS area, e.g.

if (a & b < c)
    document.write("if (a & b < c)");

Use of CDATA in program output.

artwaw

I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
I think it really depends what you expect to extract.

realroot

The data I fetch should be always inside the same "blocks", another example:

<div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>

So "/images/infobox/focus-abs.jpg".
I think that regex can do this but it didn't work.

I will look QDomDocument if that can parse html.

Christian Ehrlicher

Why don't you simply search for <img src= then?

realroot

Search with QRegularExpression? Could you clarify?

Christian Ehrlicher

@realroot said in Extract parts from webpage:

Search with QRegularExpression?

Why do you need a regexp when you want to search for a simple string?

JonB

@realroot
If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

For picking out the filepath in a regular expression you will want something like
<img src="([^"]*)"
The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

SGaist

The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

realroot

I did not think about indexOf().
I can use that indeed.

To save jpg or pdf can I use a QPixmap?

void onFinished(QNetworkReply *reply) {
    ...
    QPixmap pm;
    pm.loadFromData(reply->readAll());

SGaist

What does your reply contain ? If it's the image data, then write it directly to a file.

realroot

I still have to try it should be:

QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));

So I think it is.

I use QTextStream for text not sure to how handle images.

SGaist

So these are binary data, juste use QFile to write them to disk directly.