Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Extract parts from webpage
Forum Updated to NodeBB v4.3 + New Features

Extract parts from webpage

Scheduled Pinned Locked Moved Solved General and Desktop
18 Posts 6 Posters 1.1k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • artwawA Offline
    artwawA Offline
    artwaw
    wrote on last edited by
    #8

    I made a few approaches to parsing some services with the stream parsers and sometimes it works, sometime it fails due to unexpected content thrown in.
    Sometimes I found it better to load the page into QTextDocument and use QTextBlock/QTextCursor approach.
    I think it really depends what you expect to extract.

    For more information please re-read.

    Kind Regards,
    Artur

    1 Reply Last reply
    0
    • R Offline
      R Offline
      realroot
      wrote on last edited by
      #9

      The data I fetch should be always inside the same "blocks", another example:

      <div class="infobox-works"><img src="/images/infobox/focus-abs.jpg" alt=""></div>
      

      So "/images/infobox/focus-abs.jpg".
      I think that regex can do this but it didn't work.

      I will look QDomDocument if that can parse html.

      1 Reply Last reply
      0
      • Christian EhrlicherC Offline
        Christian EhrlicherC Offline
        Christian Ehrlicher
        Lifetime Qt Champion
        wrote on last edited by
        #10

        Why don't you simply search for <img src= then?

        Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
        Visit the Qt Academy at https://academy.qt.io/catalog

        1 Reply Last reply
        0
        • R Offline
          R Offline
          realroot
          wrote on last edited by
          #11

          Search with QRegularExpression? Could you clarify?

          Christian EhrlicherC JonBJ 2 Replies Last reply
          0
          • R realroot

            Search with QRegularExpression? Could you clarify?

            Christian EhrlicherC Offline
            Christian EhrlicherC Offline
            Christian Ehrlicher
            Lifetime Qt Champion
            wrote on last edited by
            #12

            @realroot said in Extract parts from webpage:

            Search with QRegularExpression?

            Why do you need a regexp when you want to search for a simple string?

            Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
            Visit the Qt Academy at https://academy.qt.io/catalog

            1 Reply Last reply
            0
            • R realroot

              Search with QRegularExpression? Could you clarify?

              JonBJ Offline
              JonBJ Offline
              JonB
              wrote on last edited by JonB
              #13

              @realroot
              If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

              For picking out the filepath in a regular expression you will want something like
              <img src="([^"]*)"
              The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

              Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

              SGaistS 1 Reply Last reply
              0
              • JonBJ JonB

                @realroot
                If you don't want to use regular expressions then, as @Christian-Ehrlicher has said, you could search for literal string <img src=" via indexOf(), find the next " after that, and the filepath is in-between the quote indexes.

                For picking out the filepath in a regular expression you will want something like
                <img src="([^"]*)"
                The parentheses (...) allow you to capture the string inside. You have to do whatever to protect in a C++ string, or use raw string literals.

                Even though it does not offer precise Qt syntax, I would recommend playing at e.g. https://regex101.com/ (EcmaScript (JavaScript) flavor) with bits of your input to learn how to match.

                SGaistS Offline
                SGaistS Offline
                SGaist
                Lifetime Qt Champion
                wrote on last edited by
                #14

                The regular expression tool might be worth a build and test to grab the correct syntax to use with QRegularExpression.

                Interested in AI ? www.idiap.ch
                Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                1 Reply Last reply
                1
                • R Offline
                  R Offline
                  realroot
                  wrote on last edited by
                  #15

                  I did not think about indexOf().
                  I can use that indeed.

                  To save jpg or pdf can I use a QPixmap?

                  void onFinished(QNetworkReply *reply) {
                      ...
                      QPixmap pm;
                      pm.loadFromData(reply->readAll());
                  
                  1 Reply Last reply
                  0
                  • SGaistS Offline
                    SGaistS Offline
                    SGaist
                    Lifetime Qt Champion
                    wrote on last edited by
                    #16

                    What does your reply contain ? If it's the image data, then write it directly to a file.

                    Interested in AI ? www.idiap.ch
                    Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                    1 Reply Last reply
                    0
                    • R Offline
                      R Offline
                      realroot
                      wrote on last edited by realroot
                      #17

                      I still have to try it should be:

                      QNetworkReply* reply = m_manager->get(QNetworkRequest(QUrl("https://site.com/image.jpg")));
                      

                      So I think it is.

                      I use QTextStream for text not sure to how handle images.

                      1 Reply Last reply
                      0
                      • SGaistS Offline
                        SGaistS Offline
                        SGaist
                        Lifetime Qt Champion
                        wrote on last edited by
                        #18

                        So these are binary data, juste use QFile to write them to disk directly.

                        Interested in AI ? www.idiap.ch
                        Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                        1 Reply Last reply
                        0
                        • R realroot has marked this topic as solved on

                        • Login

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • Users
                        • Groups
                        • Search
                        • Get Qt Extensions
                        • Unsolved