Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Extract parts from webpage
Forum Updated to NodeBB v4.3 + New Features

Extract parts from webpage

Scheduled Pinned Locked Moved Solved General and Desktop
18 Posts 6 Posters 1.0k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R Offline
    R Offline
    realroot
    wrote on last edited by
    #1

    I can read the webpage as html:

    if (reply->error() == QNetworkReply::NoError) {
       QString htmlContent = reply->readAll();
    }
    

    I'd like to download some image and text from the html or somehow:

    
    <!DOCTYPE html>
    <html lang="en-gb">
    <head>
    [...]
    <div class="infobox">
    <div class="infobox-map"><img src="/images/things/map/code-of-abs.jpg" alt="SITE Things: What it Works"></div>
    <div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
    <div class="infobox-focus"><img src="/images/infobox/type-abs.jpg" alt=""></div>
    <div class="infobox-difficulty"><img src="/images/infobox/difficulty-5.jpg" alt=""></div>
    <div class="infomore">
    <div class="infotext">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at orci nibh. Phasellus quis risus rhoncus, pellentesque elit vel, egestas ligula. Ut vehicula tellus quis sem consequat vehicula. Ut convallis eget odio sed mollis. Ut faucibus felis sed laoreet mollis. Nam lobortis cursus dignissim .</div>
    <div class="infoec"><strong>Extra Credit:</strong> 30 seconds rest between sets. </div>
    </div>
    

    I searched if there is some way to do this but I did not find anything.
    To start I tried to extract the infotext section with:

    const QRegularExpression MyClass::descrRegex("<div class=\\\"infotext\\\">(.*?)</div>", QRegularExpression::DotMatchesEverythingOption | QRegularExpression::MultilineOption);
    

    But I do not have any result.
    I do not much about regex so I may done something wrong.

    1 Reply Last reply
    0
    • R realroot

      Search with QRegularExpression? Could you clarify?

      JonBJ Offline
      JonBJ Offline
      JonB
      wrote on last edited by JonB
      #13

      @realroot
      If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

      For picking out the filepath in a regular expression you will want something like
      <img src="([^"]*)"
      The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

      Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

      SGaistS 1 Reply Last reply
      0
      • A Offline
        A Offline
        ankou29666
        wrote on last edited by
        #2

        Hi

        my bet would be parsing the document with QXmlStreamReader and QXmlStreamAttributes (if using Qt6, as the classes are completely different between 5 and 6, so when browsing the documentation for Qt6, be careful checking that you're viewing classes for Qt6 and not Qt5 compatibility module)

        I'm not absolutely certain this is going to work with Qt, but in browser/JS i've always seen HTML parsing done with XML parsers, HTML being more or less supposed to be XML. Or XML compliant at least.

        You would extract the objects, for example <img ...> tags, and you should find some classes with methods to extract the values for any key you would like.

        JonBJ 1 Reply Last reply
        0
        • A ankou29666

          Hi

          my bet would be parsing the document with QXmlStreamReader and QXmlStreamAttributes (if using Qt6, as the classes are completely different between 5 and 6, so when browsing the documentation for Qt6, be careful checking that you're viewing classes for Qt6 and not Qt5 compatibility module)

          I'm not absolutely certain this is going to work with Qt, but in browser/JS i've always seen HTML parsing done with XML parsers, HTML being more or less supposed to be XML. Or XML compliant at least.

          You would extract the objects, for example <img ...> tags, and you should find some classes with methods to extract the values for any key you would like.

          JonBJ Offline
          JonBJ Offline
          JonB
          wrote on last edited by JonB
          #3

          @ankou29666
          I would be "suprised" if a (decent) XML parser could parse HTML. I have not seen HTML parsing done with XML parsers, HTML it not supposed to be XML, and it is not "XML compliant". IIRC there is a standard called XHTML which is or claims to be XML-compliant, but I don't recall how I fared with it years ago, and the source has to be written as XHTML.

          Most "real" HTML pages are sprinkled with JS blocks and similar and I don't see how XML would deal with that. HTML commonly uses standalone constructs like <br> or <hr> without any closing tag. And I believe you can have/editors can produce <b><i> ... </b></i> which is accepted in HTML. Posts like https://stackoverflow.com/questions/32572928/parsing-an-html-document-using-an-xml-parser discuss this and mention that you really need a third-party library to create a web-crawler application. There may be more up-to-date answers. I should be surprised if the Qt XML parser copes with HTML.

          Of course I may be wrong and it does no harm to try, but don't hold your breath :)

          If all you want is, say, to recognise <img src="..."> tags and extract a path from it AND you are fault-tolerant --- you don't mind missing some images and you don't mind picking up something as an image which isn't or is commented out or whatever --- then you may get a limited way with regular expressions parsing.

          1 Reply Last reply
          0
          • A Offline
            A Offline
            ankou29666
            wrote on last edited by
            #4

            I don't remember where I've seen this, and I'm even wondering whether I might have done this by myself by the past. but yeah I think i'm getting a little mistaken between HTML and XHTML, i'm probably talking about the latter.
            and yeah you're right about the standalone tags that don't have a closing tag in HTML.

            What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

            JonBJ 1 Reply Last reply
            0
            • A ankou29666

              I don't remember where I've seen this, and I'm even wondering whether I might have done this by myself by the past. but yeah I think i'm getting a little mistaken between HTML and XHTML, i'm probably talking about the latter.
              and yeah you're right about the standalone tags that don't have a closing tag in HTML.

              What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

              JonBJ Offline
              JonBJ Offline
              JonB
              wrote on last edited by JonB
              #5

              @ankou29666 said in Extract parts from webpage:

              What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

              :) In between <script> and </script> you can have essentially anything over many lines. Probably anything you like except </script>. An HTML parser must know to do something like allow anything inside <script>. Where does XML have a tag like <script> which allows absolutely any arbitrary non-XML/HTML (not to mention what would it do to put that into an XML tree), and just pick up again at </script>? :)

              Btw, XML has <![CDATA[ ... ]]> for arbitrary text insert, but that's very different syntax from <script> ... </script>.

              1 Reply Last reply
              0
              • A Offline
                A Offline
                ankou29666
                wrote on last edited by
                #6

                XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
                Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

                Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
                I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

                Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

                JonBJ 1 Reply Last reply
                0
                • A ankou29666

                  XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
                  Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

                  Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
                  I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

                  Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

                  JonBJ Offline
                  JonBJ Offline
                  JonB
                  wrote on last edited by JonB
                  #7

                  @ankou29666 said in Extract parts from webpage:

                  Yeah. and what's the problem with this ?

                  < and & characters, at least, in the <script>/JS area, e.g.

                  if (a & b < c)
                      document.write("if (a & b < c)");
                  

                  Use of CDATA in program output.

                  1 Reply Last reply
                  0
                  • artwawA Offline
                    artwawA Offline
                    artwaw
                    wrote on last edited by
                    #8

                    I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
                    Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
                    I think it really depends what you expect to extract.

                    For more information please re-read.

                    Kind Regards,
                    Artur

                    1 Reply Last reply
                    0
                    • R Offline
                      R Offline
                      realroot
                      wrote on last edited by
                      #9

                      The data I fetch should be always inside the same "blocks", another example:

                      <div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
                      

                      So "/images/infobox/focus-abs.jpg".
                      I think that regex can do this but it didn't work.

                      I will look QDomDocument if that can parse html.

                      1 Reply Last reply
                      0
                      • Christian EhrlicherC Offline
                        Christian EhrlicherC Offline
                        Christian Ehrlicher
                        Lifetime Qt Champion
                        wrote on last edited by
                        #10

                        Why don't you simply search for <img src= then?

                        Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                        Visit the Qt Academy at https://academy.qt.io/catalog

                        1 Reply Last reply
                        0
                        • R Offline
                          R Offline
                          realroot
                          wrote on last edited by
                          #11

                          Search with QRegularExpression? Could you clarify?

                          Christian EhrlicherC JonBJ 2 Replies Last reply
                          0
                          • R realroot

                            Search with QRegularExpression? Could you clarify?

                            Christian EhrlicherC Offline
                            Christian EhrlicherC Offline
                            Christian Ehrlicher
                            Lifetime Qt Champion
                            wrote on last edited by
                            #12

                            @realroot said in Extract parts from webpage:

                            Search with QRegularExpression?

                            Why do you need a regexp when you want to search for a simple string?

                            Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                            Visit the Qt Academy at https://academy.qt.io/catalog

                            1 Reply Last reply
                            0
                            • R realroot

                              Search with QRegularExpression? Could you clarify?

                              JonBJ Offline
                              JonBJ Offline
                              JonB
                              wrote on last edited by JonB
                              #13

                              @realroot
                              If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                              For picking out the filepath in a regular expression you will want something like
                              <img src="([^"]*)"
                              The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                              Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                              SGaistS 1 Reply Last reply
                              0
                              • JonBJ JonB

                                @realroot
                                If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                                For picking out the filepath in a regular expression you will want something like
                                <img src="([^"]*)"
                                The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                                Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                                SGaistS Offline
                                SGaistS Offline
                                SGaist
                                Lifetime Qt Champion
                                wrote on last edited by
                                #14

                                The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

                                Interested in AI ? www.idiap.ch
                                Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                                1 Reply Last reply
                                1
                                • R Offline
                                  R Offline
                                  realroot
                                  wrote on last edited by
                                  #15

                                  I did not think about indexOf().
                                  I can use that indeed.

                                  To save jpg or pdf can I use a QPixmap?

                                  void onFinished(QNetworkReply *reply) {
                                      ...
                                      QPixmap pm;
                                      pm.loadFromData(reply->readAll());
                                  
                                  1 Reply Last reply
                                  0
                                  • SGaistS Offline
                                    SGaistS Offline
                                    SGaist
                                    Lifetime Qt Champion
                                    wrote on last edited by
                                    #16

                                    What does your reply contain ? If it's the image data, then write it directly to a file.

                                    Interested in AI ? www.idiap.ch
                                    Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                                    1 Reply Last reply
                                    0
                                    • R Offline
                                      R Offline
                                      realroot
                                      wrote on last edited by realroot
                                      #17

                                      I still have to try it should be:

                                      QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));
                                      

                                      So I think it is.

                                      I use QTextStream for text not sure to how handle images.

                                      1 Reply Last reply
                                      0
                                      • SGaistS Offline
                                        SGaistS Offline
                                        SGaist
                                        Lifetime Qt Champion
                                        wrote on last edited by
                                        #18

                                        So these are binary data, juste use QFile to write them to disk directly.

                                        Interested in AI ? www.idiap.ch
                                        Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                                        1 Reply Last reply
                                        0
                                        • R realroot has marked this topic as solved on

                                        • Login

                                        • Login or register to search.
                                        • First post
                                          Last post
                                        0
                                        • Categories
                                        • Recent
                                        • Tags
                                        • Popular
                                        • Users
                                        • Groups
                                        • Search
                                        • Get Qt Extensions
                                        • Unsolved