Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Extract parts from webpage
Forum Updated to NodeBB v4.3 + New Features

Extract parts from webpage

Scheduled Pinned Locked Moved Solved General and Desktop
18 Posts 6 Posters 1.1k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A ankou29666

    XML has no predefined tags, so what's wrong in having one named <script> rather than <foo> or <bar> ?
    Yeah the <script> tag can have content over multiple lines. Yeah. and what's the problem with this ? Now I remember a bit more about what i've done a few years ago, I had no problem extracting content of a <script> tag (among others) from browser's parser.

    Well ok, a little search makes me find out that it's JS's DOMParser I was using, which actually handles both HTML and XML.
    I had forgotten the subtle nuance between DOM and XML. Thus my mistake.

    Qt had QDomDocument class (which looks much more like what I was initially searching for) but still intended for XML and not HTML ...

    JonBJ Offline
    JonBJ Offline
    JonB
    wrote on last edited by JonB
    #7

    @ankou29666 said in Extract parts from webpage:

    Yeah. and what's the problem with this ?

    < and & characters, at least, in the <script>/JS area, e.g.

    if (a & b < c)
        document.write("if (a & b < c)");
    

    Use of CDATA in program output.

    1 Reply Last reply
    0
    • artwawA Offline
      artwawA Offline
      artwaw
      wrote on last edited by
      #8

      I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
      Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
      I think it really depends what you expect to extract.

      For more information please re-read.

      Kind Regards,
      Artur

      1 Reply Last reply
      0
      • R Offline
        R Offline
        realroot
        wrote on last edited by
        #9

        The data I fetch should be always inside the same "blocks", another example:

        <div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
        

        So "/images/infobox/focus-abs.jpg".
        I think that regex can do this but it didn't work.

        I will look QDomDocument if that can parse html.

        1 Reply Last reply
        0
        • Christian EhrlicherC Offline
          Christian EhrlicherC Offline
          Christian Ehrlicher
          Lifetime Qt Champion
          wrote on last edited by
          #10

          Why don't you simply search for <img src= then?

          Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
          Visit the Qt Academy at https://academy.qt.io/catalog

          1 Reply Last reply
          0
          • R Offline
            R Offline
            realroot
            wrote on last edited by
            #11

            Search with QRegularExpression? Could you clarify?

            Christian EhrlicherC JonBJ 2 Replies Last reply
            0
            • R realroot

              Search with QRegularExpression? Could you clarify?

              Christian EhrlicherC Offline
              Christian EhrlicherC Offline
              Christian Ehrlicher
              Lifetime Qt Champion
              wrote on last edited by
              #12

              @realroot said in Extract parts from webpage:

              Search with QRegularExpression?

              Why do you need a regexp when you want to search for a simple string?

              Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
              Visit the Qt Academy at https://academy.qt.io/catalog

              1 Reply Last reply
              0
              • R realroot

                Search with QRegularExpression? Could you clarify?

                JonBJ Offline
                JonBJ Offline
                JonB
                wrote on last edited by JonB
                #13

                @realroot
                If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                For picking out the filepath in a regular expression you will want something like
                <img src="([^"]*)"
                The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                SGaistS 1 Reply Last reply
                0
                • JonBJ JonB

                  @realroot
                  If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                  For picking out the filepath in a regular expression you will want something like
                  <img src="([^"]*)"
                  The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                  Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                  SGaistS Offline
                  SGaistS Offline
                  SGaist
                  Lifetime Qt Champion
                  wrote on last edited by
                  #14

                  The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

                  Interested in AI ? www.idiap.ch
                  Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                  1 Reply Last reply
                  1
                  • R Offline
                    R Offline
                    realroot
                    wrote on last edited by
                    #15

                    I did not think about indexOf().
                    I can use that indeed.

                    To save jpg or pdf can I use a QPixmap?

                    void onFinished(QNetworkReply *reply) {
                        ...
                        QPixmap pm;
                        pm.loadFromData(reply->readAll());
                    
                    1 Reply Last reply
                    0
                    • SGaistS Offline
                      SGaistS Offline
                      SGaist
                      Lifetime Qt Champion
                      wrote on last edited by
                      #16

                      What does your reply contain ? If it's the image data, then write it directly to a file.

                      Interested in AI ? www.idiap.ch
                      Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                      1 Reply Last reply
                      0
                      • R Offline
                        R Offline
                        realroot
                        wrote on last edited by realroot
                        #17

                        I still have to try it should be:

                        QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));
                        

                        So I think it is.

                        I use QTextStream for text not sure to how handle images.

                        1 Reply Last reply
                        0
                        • SGaistS Offline
                          SGaistS Offline
                          SGaist
                          Lifetime Qt Champion
                          wrote on last edited by
                          #18

                          So these are binary data, juste use QFile to write them to disk directly.

                          Interested in AI ? www.idiap.ch
                          Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                          1 Reply Last reply
                          0
                          • R realroot has marked this topic as solved on

                          • Login

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • Users
                          • Groups
                          • Search
                          • Get Qt Extensions
                          • Unsolved