Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Handling very big HTML documents
Forum Updated to NodeBB v4.3 + New Features

Handling very big HTML documents

Scheduled Pinned Locked Moved Solved General and Desktop
11 Posts 5 Posters 2.5k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • UndeadBlowU Offline
    UndeadBlowU Offline
    UndeadBlow
    wrote on last edited by UndeadBlow
    #1

    Hi. I need to handle (load and search) very big, ungly HTML files that exported from Google Docs. Such HTML files usually very-very excessive and ugly, because they are auto-generated. But I need to handle such big files and to fast search them.
    Have you any ideas what will be the fastest way to do that? For example, the most naive way will be to load that file in QString and search in it, but I'm afraid on some sizes it can become too heavy.

    P.S. Just for example, 6 000 letters of plain text can turn into 40 000 letters in such auto-generated HTML.

    jsulmJ 1 Reply Last reply
    0
    • UndeadBlowU UndeadBlow

      Hi. I need to handle (load and search) very big, ungly HTML files that exported from Google Docs. Such HTML files usually very-very excessive and ugly, because they are auto-generated. But I need to handle such big files and to fast search them.
      Have you any ideas what will be the fastest way to do that? For example, the most naive way will be to load that file in QString and search in it, but I'm afraid on some sizes it can become too heavy.

      P.S. Just for example, 6 000 letters of plain text can turn into 40 000 letters in such auto-generated HTML.

      jsulmJ Offline
      jsulmJ Offline
      jsulm
      Lifetime Qt Champion
      wrote on last edited by
      #2

      @UndeadBlow 40000 characters isn't much actually. Why do you think it will be too big?

      https://forum.qt.io/topic/113070/qt-code-of-conduct

      UndeadBlowU 1 Reply Last reply
      0
      • jsulmJ jsulm

        @UndeadBlow 40000 characters isn't much actually. Why do you think it will be too big?

        UndeadBlowU Offline
        UndeadBlowU Offline
        UndeadBlow
        wrote on last edited by
        #3

        @jsulm said in Handling very big HTML documents:

        @UndeadBlow 40000 characters isn't much actually. Why do you think it will be too big?

        It is just example of growing. I will be maybe hundreds of such documents.

        jsulmJ 1 Reply Last reply
        0
        • UndeadBlowU UndeadBlow

          @jsulm said in Handling very big HTML documents:

          @UndeadBlow 40000 characters isn't much actually. Why do you think it will be too big?

          It is just example of growing. I will be maybe hundreds of such documents.

          jsulmJ Offline
          jsulmJ Offline
          jsulm
          Lifetime Qt Champion
          wrote on last edited by
          #4

          @UndeadBlow Are you going to keep all these documents in memory at the same time, or are you going to handle them one after another?

          https://forum.qt.io/topic/113070/qt-code-of-conduct

          UndeadBlowU 1 Reply Last reply
          0
          • jsulmJ jsulm

            @UndeadBlow Are you going to keep all these documents in memory at the same time, or are you going to handle them one after another?

            UndeadBlowU Offline
            UndeadBlowU Offline
            UndeadBlow
            wrote on last edited by UndeadBlow
            #5

            @jsulm said in Handling very big HTML documents:

            @UndeadBlow Are you going to keep all these documents in memory at the same time, or are you going to handle them one after another?

            Most probable I will need to have them all in the memory to search in all together. That's why I want to either clean or convert that html files.
            But seems that Tidy clean not so good that files.

            1 Reply Last reply
            0
            • E Offline
              E Offline
              Eeli K
              wrote on last edited by
              #6

              Is this a one-time customized solution for a specific problem or a general solution which is deployed on several different systems? You can calculate some worst case scenario upper limit for the data and compare it with worst case scanario limited resources. Can the resources (memory) actually hold that data?

              For example, 50 000 16-bit characters in one document, 1000 documents. It takes only about 100MB of memory. Really, which system can't handle that, unless you use embedded or mobile or a credit-card-sized system?

              Another option is to use file handles, but I guess strings may be faster once you have loaded them. If you have to just run through all the files once file handles may be better. If you have to do several searches strings may be faster. Depending on your system and needs you could also delegate the task to an existing search system, for example grep. They are probably already optimized for general cases and can handle large amount of data.

              UndeadBlowU 1 Reply Last reply
              2
              • p3c0P Offline
                p3c0P Offline
                p3c0
                Moderators
                wrote on last edited by
                #7

                Since these are just plain files how about opening them with QFile, parse each line and search for the data ? In this way it wont be necessary to load whole data at once.

                157

                1 Reply Last reply
                2
                • E Eeli K

                  Is this a one-time customized solution for a specific problem or a general solution which is deployed on several different systems? You can calculate some worst case scenario upper limit for the data and compare it with worst case scanario limited resources. Can the resources (memory) actually hold that data?

                  For example, 50 000 16-bit characters in one document, 1000 documents. It takes only about 100MB of memory. Really, which system can't handle that, unless you use embedded or mobile or a credit-card-sized system?

                  Another option is to use file handles, but I guess strings may be faster once you have loaded them. If you have to just run through all the files once file handles may be better. If you have to do several searches strings may be faster. Depending on your system and needs you could also delegate the task to an existing search system, for example grep. They are probably already optimized for general cases and can handle large amount of data.

                  UndeadBlowU Offline
                  UndeadBlowU Offline
                  UndeadBlow
                  wrote on last edited by UndeadBlow
                  #8

                  @Eeli-K said in Handling very big HTML documents:

                  Is this a one-time customized solution for a specific problem or a general solution which is deployed on several different systems? You can calculate some worst case scenario upper limit for the data and compare it with worst case scanario limited resources. Can the resources (memory) actually hold that data?

                  For example, 50 000 16-bit characters in one document, 1000 documents. It takes only about 100MB of memory. Really, which system can't handle that, unless you use embedded or mobile or a credit-card-sized system?

                  Another option is to use file handles, but I guess strings may be faster once you have loaded them. If you have to just run through all the files once file handles may be better. If you have to do several searches strings may be faster. Depending on your system and needs you could also delegate the task to an existing search system, for example grep. They are probably already optimized for general cases and can handle large amount of data.

                  I worry more about time, memory most probably is not a big problem.
                  Well, search usually works by logN, maybe I should not have worried. Just had feeling that storing 100 mb in one string is not ok.

                  1 Reply Last reply
                  0
                  • mrjjM Offline
                    mrjjM Offline
                    mrjj
                    Lifetime Qt Champion
                    wrote on last edited by mrjj
                    #9

                    Hi
                    Can you give example on what you search for?
                    Do you need any of the structure of the HTML when a word is found or
                    is it simple find "love" and do not matter with the HTML at all ?
                    also when found, then what?

                    You need to know the place or is the only important info "was the word found" ?

                    UndeadBlowU 1 Reply Last reply
                    0
                    • mrjjM mrjj

                      Hi
                      Can you give example on what you search for?
                      Do you need any of the structure of the HTML when a word is found or
                      is it simple find "love" and do not matter with the HTML at all ?
                      also when found, then what?

                      You need to know the place or is the only important info "was the word found" ?

                      UndeadBlowU Offline
                      UndeadBlowU Offline
                      UndeadBlow
                      wrote on last edited by
                      #10

                      @mrjj Yes, I need HTML structure and I need the place to copy then found place, of course, otherwise it will be ok to use just plain text export from Google Docs. Seems that I can use just QString.

                      mrjjM 1 Reply Last reply
                      1
                      • UndeadBlowU UndeadBlow

                        @mrjj Yes, I need HTML structure and I need the place to copy then found place, of course, otherwise it will be ok to use just plain text export from Google Docs. Seems that I can use just QString.

                        mrjjM Offline
                        mrjjM Offline
                        mrjj
                        Lifetime Qt Champion
                        wrote on last edited by
                        #11

                        @UndeadBlow
                        Ok, so the html did matter.
                        Yes 100MB string on Desktop class is not really heavy but
                        if you have many files then it will take some time going over them all :)

                        1 Reply Last reply
                        1

                        • Login

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • Users
                        • Groups
                        • Search
                        • Get Qt Extensions
                        • Unsolved