Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Extract parts from webpage
Forum Updated to NodeBB v4.3 + New Features

Extract parts from webpage

Scheduled Pinned Locked Moved Solved General and Desktop
18 Posts 6 Posters 1.2k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    ankou29666
    wrote on last edited by
    #4

    I don't remember where I've seen this, and I'm even wondering whether I might have done this by myself by the past. but yeah I think i'm getting a little mistaken between HTML and XHTML, i'm probably talking about the latter.
    and yeah you're right about the standalone tags that don't have a closing tag in HTML.

    What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

    JonBJ 1 Reply Last reply
    0
    • A ankou29666

      I don't remember where I've seen this, and I'm even wondering whether I might have done this by myself by the past. but yeah I think i'm getting a little mistaken between HTML and XHTML, i'm probably talking about the latter.
      and yeah you're right about the standalone tags that don't have a closing tag in HTML.

      What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

      JonBJ Offline
      JonBJ Offline
      JonB
      wrote on last edited by JonB
      #5

      @ankou29666 said in Extract parts from webpage:

      What are you talking about with "JS blocks and similar" ? <script></script> ? I see no problem with this.

      :) In between <script> and </script> you can have essentially anything over many lines. Probably anything you like except </script>. An HTML parser must know to do something like allow anything inside <script>. Where does XML have a tag like <script> which allows absolutely any arbitrary non-XML/HTML (not to mention what would it do to put that into an XML tree), and just pick up again at </script>? :)

      Btw, XML has <![CDATA[ ... ]]> for arbitrary text insert, but that's very different syntax from <script> ... </script>.

      1 Reply Last reply
      0
      • A Offline
        A Offline
        ankou29666
        wrote on last edited by
        #6

        XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
        Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

        Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
        I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

        Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

        JonBJ 1 Reply Last reply
        0
        • A ankou29666

          XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
          Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

          Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
          I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

          Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

          JonBJ Offline
          JonBJ Offline
          JonB
          wrote on last edited by JonB
          #7

          @ankou29666 said in Extract parts from webpage:

          Yeah. and what's the problem with this ?

          < and & characters, at least, in the <script>/JS area, e.g.

          if (a & b < c)
              document.write("if (a & b < c)");
          

          Use of CDATA in program output.

          1 Reply Last reply
          0
          • artwawA Offline
            artwawA Offline
            artwaw
            wrote on last edited by
            #8

            I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
            Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
            I think it really depends what you expect to extract.

            For more information please re-read.

            Kind Regards,
            Artur

            1 Reply Last reply
            0
            • R Offline
              R Offline
              realroot
              wrote on last edited by
              #9

              The data I fetch should be always inside the same "blocks", another example:

              <div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
              

              So "/images/infobox/focus-abs.jpg".
              I think that regex can do this but it didn't work.

              I will look QDomDocument if that can parse html.

              1 Reply Last reply
              0
              • Christian EhrlicherC Offline
                Christian EhrlicherC Offline
                Christian Ehrlicher
                Lifetime Qt Champion
                wrote on last edited by
                #10

                Why don't you simply search for <img src= then?

                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                Visit the Qt Academy at https://academy.qt.io/catalog

                1 Reply Last reply
                0
                • R Offline
                  R Offline
                  realroot
                  wrote on last edited by
                  #11

                  Search with QRegularExpression? Could you clarify?

                  Christian EhrlicherC JonBJ 2 Replies Last reply
                  0
                  • R realroot

                    Search with QRegularExpression? Could you clarify?

                    Christian EhrlicherC Offline
                    Christian EhrlicherC Offline
                    Christian Ehrlicher
                    Lifetime Qt Champion
                    wrote on last edited by
                    #12

                    @realroot said in Extract parts from webpage:

                    Search with QRegularExpression?

                    Why do you need a regexp when you want to search for a simple string?

                    Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                    Visit the Qt Academy at https://academy.qt.io/catalog

                    1 Reply Last reply
                    0
                    • R realroot

                      Search with QRegularExpression? Could you clarify?

                      JonBJ Offline
                      JonBJ Offline
                      JonB
                      wrote on last edited by JonB
                      #13

                      @realroot
                      If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                      For picking out the filepath in a regular expression you will want something like
                      <img src="([^"]*)"
                      The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                      Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                      SGaistS 1 Reply Last reply
                      0
                      • JonBJ JonB

                        @realroot
                        If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                        For picking out the filepath in a regular expression you will want something like
                        <img src="([^"]*)"
                        The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                        Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                        SGaistS Offline
                        SGaistS Offline
                        SGaist
                        Lifetime Qt Champion
                        wrote on last edited by
                        #14

                        The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

                        Interested in AI ? www.idiap.ch
                        Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                        1 Reply Last reply
                        1
                        • R Offline
                          R Offline
                          realroot
                          wrote on last edited by
                          #15

                          I did not think about indexOf().
                          I can use that indeed.

                          To save jpg or pdf can I use a QPixmap?

                          void onFinished(QNetworkReply *reply) {
                              ...
                              QPixmap pm;
                              pm.loadFromData(reply->readAll());
                          
                          1 Reply Last reply
                          0
                          • SGaistS Offline
                            SGaistS Offline
                            SGaist
                            Lifetime Qt Champion
                            wrote on last edited by
                            #16

                            What does your reply contain ? If it's the image data, then write it directly to a file.

                            Interested in AI ? www.idiap.ch
                            Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                            1 Reply Last reply
                            0
                            • R Offline
                              R Offline
                              realroot
                              wrote on last edited by realroot
                              #17

                              I still have to try it should be:

                              QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));
                              

                              So I think it is.

                              I use QTextStream for text not sure to how handle images.

                              1 Reply Last reply
                              0
                              • SGaistS Offline
                                SGaistS Offline
                                SGaist
                                Lifetime Qt Champion
                                wrote on last edited by
                                #18

                                So these are binary data, juste use QFile to write them to disk directly.

                                Interested in AI ? www.idiap.ch
                                Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                                1 Reply Last reply
                                0
                                • R realroot has marked this topic as solved on

                                • Login

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • Users
                                • Groups
                                • Search
                                • Get Qt Extensions
                                • Unsolved