Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Download and regex parse an url source code

Download and regex parse an url source code

Scheduled Pinned Locked Moved Solved General and Desktop
26 Posts 3 Posters 4.6k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Gojir4G Offline
    Gojir4G Offline
    Gojir4
    wrote on last edited by
    #12

    From the doc of QXmlQuery
    "QXmlQuery is typically used to query XML data, but it can also query non-XML data that has been modeled to look like XML."

    and then the code example parse an HTML file:

      QXmlQuery query;
      query.setQuery("doc('index.html')/html/body/p[1]");
    

    I'm a little bit confused about this right now.

    JonBJ 1 Reply Last reply
    0
    • Gojir4G Gojir4

      From the doc of QXmlQuery
      "QXmlQuery is typically used to query XML data, but it can also query non-XML data that has been modeled to look like XML."

      and then the code example parse an HTML file:

        QXmlQuery query;
        query.setQuery("doc('index.html')/html/body/p[1]");
      

      I'm a little bit confused about this right now.

      JonBJ Offline
      JonBJ Offline
      JonB
      wrote on last edited by JonB
      #13

      @Gojir4
      Yes, note the

      non-XML data that has been modeled to look like XML

      and the page's further:

      The example uses QXmlQuery to match the first paragraph of an XML document and then output the result to a device as XML.

      So (bearing in mind I know nothing about this!), what exactly does the doc('index.html') deliver? In http://doc.qt.io/qt-5/xmlprocessing.html I can see it mentions:

      When Qt XML Patterns loads an XML resource, e.g., using the fn:doc() function

      but I can't click on that. Where is fn:doc() documented?

      EDIT
      OK, fn:doc() is just an XQuery function for accessing the document object.

      So that assumes that you already have a parsed document. All the examples I can see anywhere other than that example access a .xml file, not a .html one, which is as I would expect.

      So I assume this will only work for you if the particular HTML file you pass happens to parse as XML, i.e. it's either XHTML in the first place, or the HTML it contains does not have anything HTML-but-not-XML in it (which may be the case for some HTML documents but not others).

      Try putting, say, precisely <br> (and no </br>) somewhere in your HTML and see if it still parses? <br> is a common example of legal HTML, but is not legal in XHTML or XML...?

      Gojir4G 1 Reply Last reply
      0
      • JonBJ JonB

        @Gojir4
        Yes, note the

        non-XML data that has been modeled to look like XML

        and the page's further:

        The example uses QXmlQuery to match the first paragraph of an XML document and then output the result to a device as XML.

        So (bearing in mind I know nothing about this!), what exactly does the doc('index.html') deliver? In http://doc.qt.io/qt-5/xmlprocessing.html I can see it mentions:

        When Qt XML Patterns loads an XML resource, e.g., using the fn:doc() function

        but I can't click on that. Where is fn:doc() documented?

        EDIT
        OK, fn:doc() is just an XQuery function for accessing the document object.

        So that assumes that you already have a parsed document. All the examples I can see anywhere other than that example access a .xml file, not a .html one, which is as I would expect.

        So I assume this will only work for you if the particular HTML file you pass happens to parse as XML, i.e. it's either XHTML in the first place, or the HTML it contains does not have anything HTML-but-not-XML in it (which may be the case for some HTML documents but not others).

        Try putting, say, precisely <br> (and no </br>) somewhere in your HTML and see if it still parses? <br> is a common example of legal HTML, but is not legal in XHTML or XML...?

        Gojir4G Offline
        Gojir4G Offline
        Gojir4
        wrote on last edited by
        #14

        @JonB I think the fn:doc is part of the XQuery/XPath specification

        JonBJ 1 Reply Last reply
        0
        • Gojir4G Gojir4

          @JonB I think the fn:doc is part of the XQuery/XPath specification

          JonBJ Offline
          JonBJ Offline
          JonB
          wrote on last edited by
          #15

          @Gojir4
          See my EDIT above.

          Gojir4G 1 Reply Last reply
          0
          • JonBJ JonB

            @Gojir4
            See my EDIT above.

            Gojir4G Offline
            Gojir4G Offline
            Gojir4
            wrote on last edited by
            #16

            @JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
            But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

            JonBJ 1 Reply Last reply
            0
            • Gojir4G Gojir4

              @JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
              But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

              JonBJ Offline
              JonBJ Offline
              JonB
              wrote on last edited by
              #17

              @Gojir4

              But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

              And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!

              Gojir4G 1 Reply Last reply
              0
              • JonBJ JonB

                @Gojir4

                But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

                And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!

                Gojir4G Offline
                Gojir4G Offline
                Gojir4
                wrote on last edited by
                #18

                @JonB said in Download and regex parse an url source code:

                and regular expressions are a hack which at best work "approximately"

                I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
                But, that's only my opinion.

                JonBJ 1 Reply Last reply
                0
                • Gojir4G Gojir4

                  @JonB said in Download and regex parse an url source code:

                  and regular expressions are a hack which at best work "approximately"

                  I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
                  But, that's only my opinion.

                  JonBJ Offline
                  JonBJ Offline
                  JonB
                  wrote on last edited by
                  #19

                  @Gojir4
                  I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.

                  There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.

                  Gojir4G 1 Reply Last reply
                  1
                  • M Offline
                    M Offline
                    Mr Gisa
                    wrote on last edited by
                    #20

                    I solved the problem by using the myhtml library, it's fast and did the trick really nicely.

                    JonBJ 1 Reply Last reply
                    0
                    • M Mr Gisa

                      I solved the problem by using the myhtml library, it's fast and did the trick really nicely.

                      JonBJ Offline
                      JonBJ Offline
                      JonB
                      wrote on last edited by
                      #21

                      @Mr-Gisa
                      Then you should make that myhtml in your post a link to wherever it is, to help others. Thanks.

                      1 Reply Last reply
                      0
                      • M Offline
                        M Offline
                        Mr Gisa
                        wrote on last edited by
                        #22

                        @JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.

                            QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>";
                            QByteArray chtml = html.toUtf8().constData();
                        
                            // basic init
                            myhtml_t* myhtml = myhtml_create();
                            myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
                        
                            // first tree init
                            myhtml_tree_t* tree = myhtml_tree_create();
                            myhtml_tree_init(tree, myhtml);
                        
                            // parse html
                            myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml));
                        
                            // get the A collection
                            myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
                        
                            for(size_t i = 0; i < collection->length; i++) {
                                myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4);
                        
                                if (gets_attr) {
                                    const char *attr_char = myhtml_attribute_value(gets_attr, NULL);
                                    qDebug() << attr_char;
                                }
                            }
                        
                            // release resources
                            myhtml_collection_destroy(collection);
                            myhtml_tree_destroy(tree);
                            myhtml_destroy(myhtml);
                        
                        JonBJ 1 Reply Last reply
                        1
                        • M Mr Gisa

                          @JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.

                              QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>";
                              QByteArray chtml = html.toUtf8().constData();
                          
                              // basic init
                              myhtml_t* myhtml = myhtml_create();
                              myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
                          
                              // first tree init
                              myhtml_tree_t* tree = myhtml_tree_create();
                              myhtml_tree_init(tree, myhtml);
                          
                              // parse html
                              myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml));
                          
                              // get the A collection
                              myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
                          
                              for(size_t i = 0; i < collection->length; i++) {
                                  myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4);
                          
                                  if (gets_attr) {
                                      const char *attr_char = myhtml_attribute_value(gets_attr, NULL);
                                      qDebug() << attr_char;
                                  }
                              }
                          
                              // release resources
                              myhtml_collection_destroy(collection);
                              myhtml_tree_destroy(tree);
                              myhtml_destroy(myhtml);
                          
                          JonBJ Offline
                          JonBJ Offline
                          JonB
                          wrote on last edited by
                          #23

                          @Mr-Gisa
                          No, you misunderstand! I want to know: what is this "myhtml" thing? Is it a package? Source code? I want the hyperlink to wherever it is, so that I can look at/download it like you have done!

                          1 Reply Last reply
                          0
                          • M Offline
                            M Offline
                            Mr Gisa
                            wrote on last edited by
                            #24

                            MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies. https://github.com/lexborisov/myhtml

                            1 Reply Last reply
                            2
                            • JonBJ JonB

                              @Gojir4
                              I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.

                              There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.

                              Gojir4G Offline
                              Gojir4G Offline
                              Gojir4
                              wrote on last edited by
                              #25

                              @JonB Ok, I see, sorry for my misunderstood. I agree with that. My point was that if you know in advance which format you will have to parse (as for Doxygen), regex and xquery can becomes a solution. Anyway, the problem has been solved :).

                              1 Reply Last reply
                              0
                              • M Offline
                                M Offline
                                Mr Gisa
                                wrote on last edited by
                                #26

                                That is okay, you helped a lot

                                1 Reply Last reply
                                0

                                • Login

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • Users
                                • Groups
                                • Search
                                • Get Qt Extensions
                                • Unsolved