Handling very big HTML documents

UndeadBlow

Hi. I need to handle (load and search) very big, ungly HTML files that exported from Google Docs. Such HTML files usually very-very excessive and ugly, because they are auto-generated. But I need to handle such big files and to fast search them.
Have you any ideas what will be the fastest way to do that? For example, the most naive way will be to load that file in QString and search in it, but I'm afraid on some sizes it can become too heavy.

P.S. Just for example, 6 000 letters of plain text can turn into 40 000 letters in such auto-generated HTML.

jsulm

@UndeadBlow 40000 characters isn't much actually. Why do you think it will be too big?

UndeadBlow

@jsulm said in Handling very big HTML documents:

@UndeadBlow 40000 characters isn't much actually. Why do you think it will be too big?

It is just example of growing. I will be maybe hundreds of such documents.

jsulm

@UndeadBlow Are you going to keep all these documents in memory at the same time, or are you going to handle them one after another?

UndeadBlow

@jsulm said in Handling very big HTML documents:

@UndeadBlow Are you going to keep all these documents in memory at the same time, or are you going to handle them one after another?

Most probable I will need to have them all in the memory to search in all together. That's why I want to either clean or convert that html files.
But seems that Tidy clean not so good that files.

Eeli K

Is this a one-time customized solution for a specific problem or a general solution which is deployed on several different systems? You can calculate some worst case scenario upper limit for the data and compare it with worst case scanario limited resources. Can the resources (memory) actually hold that data?

For example, 50 000 16-bit characters in one document, 1000 documents. It takes only about 100MB of memory. Really, which system can't handle that, unless you use embedded or mobile or a credit-card-sized system?

Another option is to use file handles, but I guess strings may be faster once you have loaded them. If you have to just run through all the files once file handles may be better. If you have to do several searches strings may be faster. Depending on your system and needs you could also delegate the task to an existing search system, for example grep. They are probably already optimized for general cases and can handle large amount of data.

p3c0

Since these are just plain files how about opening them with QFile, parse each line and search for the data ? In this way it wont be necessary to load whole data at once.

UndeadBlow

@Eeli-K said in Handling very big HTML documents:

Is this a one-time customized solution for a specific problem or a general solution which is deployed on several different systems? You can calculate some worst case scenario upper limit for the data and compare it with worst case scanario limited resources. Can the resources (memory) actually hold that data?

For example, 50 000 16-bit characters in one document, 1000 documents. It takes only about 100MB of memory. Really, which system can't handle that, unless you use embedded or mobile or a credit-card-sized system?

Another option is to use file handles, but I guess strings may be faster once you have loaded them. If you have to just run through all the files once file handles may be better. If you have to do several searches strings may be faster. Depending on your system and needs you could also delegate the task to an existing search system, for example grep. They are probably already optimized for general cases and can handle large amount of data.

I worry more about time, memory most probably is not a big problem.
Well, search usually works by logN, maybe I should not have worried. Just had feeling that storing 100 mb in one string is not ok.

mrjj

Hi
Can you give example on what you search for?
Do you need any of the structure of the HTML when a word is found or
is it simple find "love" and do not matter with the HTML at all ?
also when found, then what?

You need to know the place or is the only important info "was the word found" ?

UndeadBlow

@mrjj Yes, I need HTML structure and I need the place to copy then found place, of course, otherwise it will be ok to use just plain text export from Google Docs. Seems that I can use just QString.

mrjj

@UndeadBlow
Ok, so the html did matter.
Yes 100MB string on Desktop class is not really heavy but
if you have many files then it will take some time going over them all :)