Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Extract parts from webpage
Qt 6.11 is out! See what's new in the release blog

Extract parts from webpage

Scheduled Pinned Locked Moved Solved General and Desktop
18 Posts 6 Posters 1.6k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    ankou29666
    wrote on last edited by
    #6

    XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
    Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

    Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
    I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

    Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

    JonBJ 1 Reply Last reply
    0
    • A ankou29666

      XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
      Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

      Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
      I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

      Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

      JonBJ Online
      JonBJ Online
      JonB
      wrote on last edited by JonB
      #7

      @ankou29666 said in Extract parts from webpage:

      Yeah. and what's the problem with this ?

      < and & characters, at least, in the <script>/JS area, e.g.

      if (a & b < c)
          document.write("if (a & b < c)");
      

      Use of CDATA in program output.

      1 Reply Last reply
      0
      • artwawA Offline
        artwawA Offline
        artwaw
        wrote on last edited by
        #8

        I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
        Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
        I think it really depends what you expect to extract.

        For more information please re-read.

        Kind Regards,
        Artur

        1 Reply Last reply
        0
        • R Offline
          R Offline
          realroot
          wrote on last edited by
          #9

          The data I fetch should be always inside the same "blocks", another example:

          <div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
          

          So "/images/infobox/focus-abs.jpg".
          I think that regex can do this but it didn't work.

          I will look QDomDocument if that can parse html.

          1 Reply Last reply
          0
          • Christian EhrlicherC Offline
            Christian EhrlicherC Offline
            Christian Ehrlicher
            Lifetime Qt Champion
            wrote on last edited by
            #10

            Why don't you simply search for <img src= then?

            Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
            Visit the Qt Academy at https://academy.qt.io/catalog

            1 Reply Last reply
            0
            • R Offline
              R Offline
              realroot
              wrote on last edited by
              #11

              Search with QRegularExpression? Could you clarify?

              Christian EhrlicherC JonBJ 2 Replies Last reply
              0
              • R realroot

                Search with QRegularExpression? Could you clarify?

                Christian EhrlicherC Offline
                Christian EhrlicherC Offline
                Christian Ehrlicher
                Lifetime Qt Champion
                wrote on last edited by
                #12

                @realroot said in Extract parts from webpage:

                Search with QRegularExpression?

                Why do you need a regexp when you want to search for a simple string?

                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                Visit the Qt Academy at https://academy.qt.io/catalog

                1 Reply Last reply
                0
                • R realroot

                  Search with QRegularExpression? Could you clarify?

                  JonBJ Online
                  JonBJ Online
                  JonB
                  wrote on last edited by JonB
                  #13

                  @realroot
                  If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                  For picking out the filepath in a regular expression you will want something like
                  <img src="([^"]*)"
                  The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                  Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                  SGaistS 1 Reply Last reply
                  0
                  • JonBJ JonB

                    @realroot
                    If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                    For picking out the filepath in a regular expression you will want something like
                    <img src="([^"]*)"
                    The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                    Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                    SGaistS Offline
                    SGaistS Offline
                    SGaist
                    Lifetime Qt Champion
                    wrote on last edited by
                    #14

                    The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

                    Interested in AI ? www.idiap.ch
                    Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                    1 Reply Last reply
                    1
                    • R Offline
                      R Offline
                      realroot
                      wrote on last edited by
                      #15

                      I did not think about indexOf().
                      I can use that indeed.

                      To save jpg or pdf can I use a QPixmap?

                      void onFinished(QNetworkReply *reply) {
                          ...
                          QPixmap pm;
                          pm.loadFromData(reply->readAll());
                      
                      1 Reply Last reply
                      0
                      • SGaistS Offline
                        SGaistS Offline
                        SGaist
                        Lifetime Qt Champion
                        wrote on last edited by
                        #16

                        What does your reply contain ? If it's the image data, then write it directly to a file.

                        Interested in AI ? www.idiap.ch
                        Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                        1 Reply Last reply
                        0
                        • R Offline
                          R Offline
                          realroot
                          wrote on last edited by realroot
                          #17

                          I still have to try it should be:

                          QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));
                          

                          So I think it is.

                          I use QTextStream for text not sure to how handle images.

                          1 Reply Last reply
                          0
                          • SGaistS Offline
                            SGaistS Offline
                            SGaist
                            Lifetime Qt Champion
                            wrote on last edited by
                            #18

                            So these are binary data, juste use QFile to write them to disk directly.

                            Interested in AI ? www.idiap.ch
                            Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                            1 Reply Last reply
                            0
                            • R realroot has marked this topic as solved on

                            • Login

                            • Login or register to search.
                            • First post
                              Last post
                            0
                            • Categories
                            • Recent
                            • Tags
                            • Popular
                            • Users
                            • Groups
                            • Search
                            • Get Qt Extensions
                            • Unsolved