Download and regex parse an url source code
-
I was wondering, how can I download a web page source code and get all the links in it?
I have this regex here:
((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)
But now I need to know how to use it to parse the entire html code.
wrote on 11 May 2018, 07:30 last edited by JonB 5 Nov 2018, 07:35@Mr-Gisa
Just to be clear: web page source code is written in HTML + whatever (e.g. JavaScript, etc.). Qt does not have an HTML parser. You cannot reliably parse any HTML using regular expressions. You cannot reliably recognise what bits are really URLs in the web page with regular expressions.What you can do is use regular expressions to "guess" what bits might be genuine URLs. You might find some spurious, "extra" ones, and you might miss some "genuine" ones. It might do better or worse on some HTML sources (e.g. Doxygen) than others.
So long as you are happy with this "approximation" that's OK. But you will never guarantee to get them all correctly.
I don't really know what @Gojir4 's code is doing. HTML is not XML, so if it's relying on HTML input being parseable as well-formed XML it will fail. XHTML is XML, and so will parse, but not many sites produce XHTML. Maybe Doxygen does, others do not.
-
@Mr-Gisa
Just to be clear: web page source code is written in HTML + whatever (e.g. JavaScript, etc.). Qt does not have an HTML parser. You cannot reliably parse any HTML using regular expressions. You cannot reliably recognise what bits are really URLs in the web page with regular expressions.What you can do is use regular expressions to "guess" what bits might be genuine URLs. You might find some spurious, "extra" ones, and you might miss some "genuine" ones. It might do better or worse on some HTML sources (e.g. Doxygen) than others.
So long as you are happy with this "approximation" that's OK. But you will never guarantee to get them all correctly.
I don't really know what @Gojir4 's code is doing. HTML is not XML, so if it's relying on HTML input being parseable as well-formed XML it will fail. XHTML is XML, and so will parse, but not many sites produce XHTML. Maybe Doxygen does, others do not.
-
wrote on 11 May 2018, 08:13 last edited by
From the doc of QXmlQuery
"QXmlQuery is typically used to query XML data, but it can also query non-XML data that has been modeled to look like XML."and then the code example parse an HTML file:
QXmlQuery query; query.setQuery("doc('index.html')/html/body/p[1]");
I'm a little bit confused about this right now.
-
From the doc of QXmlQuery
"QXmlQuery is typically used to query XML data, but it can also query non-XML data that has been modeled to look like XML."and then the code example parse an HTML file:
QXmlQuery query; query.setQuery("doc('index.html')/html/body/p[1]");
I'm a little bit confused about this right now.
wrote on 11 May 2018, 09:01 last edited by JonB 5 Nov 2018, 09:23@Gojir4
Yes, note thenon-XML data that has been modeled to look like XML
and the page's further:
The example uses QXmlQuery to match the first paragraph of an XML document and then output the result to a device as XML.
So (bearing in mind I know nothing about this!), what exactly does the
doc('index.html')
deliver? In http://doc.qt.io/qt-5/xmlprocessing.html I can see it mentions:When Qt XML Patterns loads an XML resource, e.g., using the
fn:doc()
functionbut I can't click on that. Where is
fn:doc()
documented?EDIT
OK,fn:doc()
is just an XQuery function for accessing thedocument
object.So that assumes that you already have a parsed document. All the examples I can see anywhere other than that example access a
.xml
file, not a.html
one, which is as I would expect.So I assume this will only work for you if the particular HTML file you pass happens to parse as XML, i.e. it's either XHTML in the first place, or the HTML it contains does not have anything HTML-but-not-XML in it (which may be the case for some HTML documents but not others).
Try putting, say, precisely
<br>
(and no</br>
) somewhere in your HTML and see if it still parses?<br>
is a common example of legal HTML, but is not legal in XHTML or XML...? -
@Gojir4
Yes, note thenon-XML data that has been modeled to look like XML
and the page's further:
The example uses QXmlQuery to match the first paragraph of an XML document and then output the result to a device as XML.
So (bearing in mind I know nothing about this!), what exactly does the
doc('index.html')
deliver? In http://doc.qt.io/qt-5/xmlprocessing.html I can see it mentions:When Qt XML Patterns loads an XML resource, e.g., using the
fn:doc()
functionbut I can't click on that. Where is
fn:doc()
documented?EDIT
OK,fn:doc()
is just an XQuery function for accessing thedocument
object.So that assumes that you already have a parsed document. All the examples I can see anywhere other than that example access a
.xml
file, not a.html
one, which is as I would expect.So I assume this will only work for you if the particular HTML file you pass happens to parse as XML, i.e. it's either XHTML in the first place, or the HTML it contains does not have anything HTML-but-not-XML in it (which may be the case for some HTML documents but not others).
Try putting, say, precisely
<br>
(and no</br>
) somewhere in your HTML and see if it still parses?<br>
is a common example of legal HTML, but is not legal in XHTML or XML...? -
wrote on 11 May 2018, 09:55 last edited by
@JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery. -
@JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.wrote on 11 May 2018, 10:18 last edited byBut, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.
And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!
-
But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.
And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!
wrote on 11 May 2018, 10:36 last edited by@JonB said in Download and regex parse an url source code:
and regular expressions are a hack which at best work "approximately"
I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
But, that's only my opinion. -
@JonB said in Download and regex parse an url source code:
and regular expressions are a hack which at best work "approximately"
I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
But, that's only my opinion.wrote on 11 May 2018, 10:40 last edited by@Gojir4
I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.
-
wrote on 11 May 2018, 13:48 last edited by
I solved the problem by using the myhtml library, it's fast and did the trick really nicely.
-
I solved the problem by using the myhtml library, it's fast and did the trick really nicely.
-
wrote on 11 May 2018, 14:31 last edited by
@JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.
QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>"; QByteArray chtml = html.toUtf8().constData(); // basic init myhtml_t* myhtml = myhtml_create(); myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0); // first tree init myhtml_tree_t* tree = myhtml_tree_create(); myhtml_tree_init(tree, myhtml); // parse html myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml)); // get the A collection myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL); for(size_t i = 0; i < collection->length; i++) { myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4); if (gets_attr) { const char *attr_char = myhtml_attribute_value(gets_attr, NULL); qDebug() << attr_char; } } // release resources myhtml_collection_destroy(collection); myhtml_tree_destroy(tree); myhtml_destroy(myhtml);
-
@JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.
QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>"; QByteArray chtml = html.toUtf8().constData(); // basic init myhtml_t* myhtml = myhtml_create(); myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0); // first tree init myhtml_tree_t* tree = myhtml_tree_create(); myhtml_tree_init(tree, myhtml); // parse html myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml)); // get the A collection myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL); for(size_t i = 0; i < collection->length; i++) { myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4); if (gets_attr) { const char *attr_char = myhtml_attribute_value(gets_attr, NULL); qDebug() << attr_char; } } // release resources myhtml_collection_destroy(collection); myhtml_tree_destroy(tree); myhtml_destroy(myhtml);
-
wrote on 11 May 2018, 14:37 last edited by
MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies. https://github.com/lexborisov/myhtml
-
@Gojir4
I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.
wrote on 11 May 2018, 16:06 last edited by@JonB Ok, I see, sorry for my misunderstood. I agree with that. My point was that if you know in advance which format you will have to parse (as for Doxygen), regex and xquery can becomes a solution. Anyway, the problem has been solved :).
-
wrote on 11 May 2018, 16:06 last edited by
That is okay, you helped a lot
19/26