Extract parts from webpage
-
XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
I had forgotten the subtle nuance between DOM and XML. Thus my mistake.Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...
-
XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
I had forgotten the subtle nuance between DOM and XML. Thus my mistake.Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...
@ankou29666 said in Extract parts from webpage:
Yeah. and what's the problem with this ?
<
and&
characters, at least, in the<script>
/JS area, e.g.if (a & b < c) document.write("if (a & b < c)");
-
I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
I think it really depends what you expect to extract. -
The data I fetch should be always inside the same "blocks", another example:
<div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
So
"/images/infobox/focus-abs.jpg"
.
I think that regex can do this but it didn't work.I will look
QDomDocument
if that can parse html. -
Why don't you simply search for
<img src=
then? -
@realroot said in Extract parts from webpage:
Search with QRegularExpression?
Why do you need a regexp when you want to search for a simple string?
-
@realroot
If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string<img src="
viaindexOf()
, find the next"
after that, and the filepath is in-between the quote indexes.For picking out the filepath in a regular expression you will want something like
<img src="([^"]*)"
The parentheses(...)
allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.
-
@realroot
If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string<img src="
viaindexOf()
, find the next"
after that, and the filepath is in-between the quote indexes.For picking out the filepath in a regular expression you will want something like
<img src="([^"]*)"
The parentheses(...)
allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.
The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.
-
What does your reply contain ? If it's the image data, then write it directly to a file.
-
So these are binary data, juste use QFile to write them to disk directly.
-