Download and regex parse an url source code

JonB · 11 May 2018, 09:55

@Gojir4
See my EDIT above.

Gojir4 · 11 May 2018, 10:18

@JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

JonB · 11 May 2018, 10:36

@Gojir4

But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!

Gojir4 · 11 May 2018, 10:40

@JonB said in Download and regex parse an url source code:

and regular expressions are a hack which at best work "approximately"

I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
But, that's only my opinion.

JonB · 11 May 2018, 16:06

@Gojir4
I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.

There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.

Mr Gisa · 11 May 2018, 14:26

I solved the problem by using the myhtml library, it's fast and did the trick really nicely.

JonB · wrote on 11 May 2018, 14:26

@Mr-Gisa
Then you should make that myhtml in your post a link to wherever it is, to help others. Thanks.

Mr Gisa · 11 May 2018, 14:33

@JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.

    QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>";
    QByteArray chtml = html.toUtf8().constData();

    // basic init
    myhtml_t* myhtml = myhtml_create();
    myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);

    // first tree init
    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);

    // parse html
    myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml));

    // get the A collection
    myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);

    for(size_t i = 0; i < collection->length; i++) {
        myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4);

        if (gets_attr) {
            const char *attr_char = myhtml_attribute_value(gets_attr, NULL);
            qDebug() << attr_char;
        }
    }

    // release resources
    myhtml_collection_destroy(collection);
    myhtml_tree_destroy(tree);
    myhtml_destroy(myhtml);

JonB · wrote on 11 May 2018, 14:33

@Mr-Gisa
No, you misunderstand! I want to know: what is this "myhtml" thing? Is it a package? Source code? I want the hyperlink to wherever it is, so that I can look at/download it like you have done!

Mr Gisa · wrote on 11 May 2018, 14:37

MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies. https://github.com/lexborisov/myhtml

Gojir4 · J JonB 11 May 2018, 10:40

@JonB Ok, I see, sorry for my misunderstood. I agree with that. My point was that if you know in advance which format you will have to parse (as for Doxygen), regex and xquery can becomes a solution. Anyway, the problem has been solved :).

Mr Gisa · wrote on 11 May 2018, 16:06

That is okay, you helped a lot