Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Download and regex parse an url source code
Forum Updated to NodeBB v4.3 + New Features

Download and regex parse an url source code

Scheduled Pinned Locked Moved Solved General and Desktop
26 Posts 3 Posters 4.6k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Mr Gisa
    11 May 2018, 13:48

    I solved the problem by using the myhtml library, it's fast and did the trick really nicely.

    J Offline
    J Offline
    JonB
    wrote on 11 May 2018, 14:26 last edited by
    #21

    @Mr-Gisa
    Then you should make that myhtml in your post a link to wherever it is, to help others. Thanks.

    1 Reply Last reply
    0
    • M Offline
      M Offline
      Mr Gisa
      wrote on 11 May 2018, 14:31 last edited by
      #22

      @JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.

          QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>";
          QByteArray chtml = html.toUtf8().constData();
      
          // basic init
          myhtml_t* myhtml = myhtml_create();
          myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
      
          // first tree init
          myhtml_tree_t* tree = myhtml_tree_create();
          myhtml_tree_init(tree, myhtml);
      
          // parse html
          myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml));
      
          // get the A collection
          myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
      
          for(size_t i = 0; i < collection->length; i++) {
              myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4);
      
              if (gets_attr) {
                  const char *attr_char = myhtml_attribute_value(gets_attr, NULL);
                  qDebug() << attr_char;
              }
          }
      
          // release resources
          myhtml_collection_destroy(collection);
          myhtml_tree_destroy(tree);
          myhtml_destroy(myhtml);
      
      J 1 Reply Last reply 11 May 2018, 14:33
      1
      • M Mr Gisa
        11 May 2018, 14:31

        @JonB I was going to do that but due the heavy amount of things I forgot, thanks for pointing it out.

            QString html = "<html><head></head><body><div><span>HTML</span><a href=\"http://www.google.com\">a</a><a href=\"ohyeah.com\">b</a></div></body></html>";
            QByteArray chtml = html.toUtf8().constData();
        
            // basic init
            myhtml_t* myhtml = myhtml_create();
            myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
        
            // first tree init
            myhtml_tree_t* tree = myhtml_tree_create();
            myhtml_tree_init(tree, myhtml);
        
            // parse html
            myhtml_parse(tree, MyENCODING_UTF_8, chtml, strlen(chtml));
        
            // get the A collection
            myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
        
            for(size_t i = 0; i < collection->length; i++) {
                myhtml_tree_attr_t *gets_attr = myhtml_attribute_by_key(collection->list[i], "href", 4);
        
                if (gets_attr) {
                    const char *attr_char = myhtml_attribute_value(gets_attr, NULL);
                    qDebug() << attr_char;
                }
            }
        
            // release resources
            myhtml_collection_destroy(collection);
            myhtml_tree_destroy(tree);
            myhtml_destroy(myhtml);
        
        J Offline
        J Offline
        JonB
        wrote on 11 May 2018, 14:33 last edited by
        #23

        @Mr-Gisa
        No, you misunderstand! I want to know: what is this "myhtml" thing? Is it a package? Source code? I want the hyperlink to wherever it is, so that I can look at/download it like you have done!

        1 Reply Last reply
        0
        • M Offline
          M Offline
          Mr Gisa
          wrote on 11 May 2018, 14:37 last edited by
          #24

          MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies. https://github.com/lexborisov/myhtml

          1 Reply Last reply
          2
          • J JonB
            11 May 2018, 10:40

            @Gojir4
            I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.

            There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.

            G Offline
            G Offline
            Gojir4
            wrote on 11 May 2018, 16:06 last edited by
            #25

            @JonB Ok, I see, sorry for my misunderstood. I agree with that. My point was that if you know in advance which format you will have to parse (as for Doxygen), regex and xquery can becomes a solution. Anyway, the problem has been solved :).

            1 Reply Last reply
            0
            • M Offline
              M Offline
              Mr Gisa
              wrote on 11 May 2018, 16:06 last edited by
              #26

              That is okay, you helped a lot

              1 Reply Last reply
              0

              21/26

              11 May 2018, 14:26

              • Login

              • Login or register to search.
              21 out of 26
              • First post
                21/26
                Last post
              0
              • Categories
              • Recent
              • Tags
              • Popular
              • Users
              • Groups
              • Search
              • Get Qt Extensions
              • Unsolved