Download and regex parse an url source code
-
I solved the problem by using the myhtml library, it's fast and did the trick really nicely.
-
wrote on 11 May 2018, 14:31 last edited by
@JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.
QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>"; QByteArray chtml = html.toUtf8().constData(); // basic init myhtml_t* myhtml = myhtml_create(); myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0); // first tree init myhtml_tree_t* tree = myhtml_tree_create(); myhtml_tree_init(tree, myhtml); // parse html myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml)); // get the A collection myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL); for(size_t i = 0; i < collection->length; i++) { myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4); if (gets_attr) { const char *attr_char = myhtml_attribute_value(gets_attr, NULL); qDebug() << attr_char; } } // release resources myhtml_collection_destroy(collection); myhtml_tree_destroy(tree); myhtml_destroy(myhtml);
-
@JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.
QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>"; QByteArray chtml = html.toUtf8().constData(); // basic init myhtml_t* myhtml = myhtml_create(); myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0); // first tree init myhtml_tree_t* tree = myhtml_tree_create(); myhtml_tree_init(tree, myhtml); // parse html myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml)); // get the A collection myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL); for(size_t i = 0; i < collection->length; i++) { myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4); if (gets_attr) { const char *attr_char = myhtml_attribute_value(gets_attr, NULL); qDebug() << attr_char; } } // release resources myhtml_collection_destroy(collection); myhtml_tree_destroy(tree); myhtml_destroy(myhtml);
-
wrote on 11 May 2018, 14:37 last edited by
MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies. https://github.com/lexborisov/myhtml
-
@Gojir4
I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.
wrote on 11 May 2018, 16:06 last edited by@JonB Ok, I see, sorry for my misunderstood. I agree with that. My point was that if you know in advance which format you will have to parse (as for Doxygen), regex and xquery can becomes a solution. Anyway, the problem has been solved :).
-
wrote on 11 May 2018, 16:06 last edited by
That is okay, you helped a lot
21/26