Which type of database should I choose to store JSON data?

npatil15 · 16 Jun 2020, 07:51

Hello All,

I have number of JSON files, like 20,000 or 30,000, and more to process at once.
I just need count about XYZ key present in particular file.
I have found some solutions as below:

Store data in a variable like QHash<QString, QHash<QString, QList<s_DATA>>> HashData and extract from it for further processing. Now here it takes time to extract required data from JSON and then put into variable HashData. Extracting was done using the recursive method by digging into files.
Store data in Model like QSqlTableModel. Create a standard database as per JSON file structure and then extract data using applying query for further processing.
Store data in Model like QStandardItemModel

I have applied the first solution and it takes normally 1.5 minutes to process 3000 files of JSON, so I'm not sure does it will be good to implement/try other solutions to make it fast enough.
Please provide me suggestions about other solutions or if you have any other solution for it and how it will be good to implement over first solution.

Thanks

mrjj · wrote on 16 Jun 2020, 07:08

Hi
If you MUST have data in JSON in first place, then the cost is to parse the files so i don't see putting the json
files into any other db or model can help with processing time at all.

I would try another parser and see
like this
https://github.com/simdjson/simdjson

Also, you might be able to optimize your code to run faster but its hard to say from pure guessing.

JonB · wrote on 16 Jun 2020, 07:51

@npatil15
As @mrjj said, no point putting these in a database per se.

However, if processing time is critical, the key does not occur in a fair proportion of the files, and you are willing to use a tiny bit of "fuzzy" logic: you could probably save a lot of time if there are many files to eliminate simply be pre-searching your files for XYZ key (or indeed for "XYZ":) --- remember JSON is text --- and only passing on those files which contain it to be JSON-parsed. Text searching will be fast, parsing will by much slower.

If that improves for you, you could then further store all the files in a database table, provided the DB provider allows it to be stored as a text blob which can be searched fast, to avoid the overhead of having to open that many files.

Finally, alternatively/additionally, you could do the reading/searching across a number of threads. I don't know how much difference it would make --- you'd have to test --- but there may be some latency in read I/O which could be saved (i.e. one thread may be able to search while another is "sleeping" on I/O read).

mrjj · wrote on 16 Jun 2020, 09:04

Hi
Btw how big are each file?
If they are not so big, you could also load to memory and use threading as @JonB suggest to parse them in parallel. That could potentially cut time time in 4
or more depending on the hardware/cpu.

npatil15 · 16 Jun 2020, 09:19

@mrjj , normally file on average below 300kb.

Let me correct if I'm wrong, so as per @JonB solution, I have to create n threads for n number of JSON files. So I'll try this.

And FYI, I have used this script (QJsonPath) to dig into the JSON file to extract data and from the data, I used to take count of it only.

J.Hilk · wrote on 16 Jun 2020, 09:19

@npatil15 said in Which type of database should I choose to store JSON data?:

I have to create n threads for n number of JSON files.

nonono!

You create n threads where n < Number of CPU cores, as a rule of thumb

To make it more professional, use https://doc.qt.io/qt-5/qthread.html#idealThreadCount

npatil15 · 16 Jun 2020, 09:31

At this (threadandconcurrent), I have found some threading solutions, so which I should prefer?

I'll explain whatever I understood about implementation:

Create threads, based on number fo CPU core available
Will have one global structure where I'll store all information extracted from threads. I'll implement a mutex lock to not affect by deadlock. A semaphore will be useless because it has no dependency on other files.
Then I can use that global variable/structure for further processing. (This part is already working fine/fast).

Correct me if anything is wrong.

J.Hilk · wrote on 16 Jun 2020, 09:31

@npatil15 sounds about right, as one solution

but instead of

Will have one global structure where I'll store all information extracted from threads. I'll implement a mutex lock to not affect by deadlock. A semaphore will be useless because it has no dependency on other files.

you could add a copy constructor to a smaller struct and use signal & slots to transfer it across threads.

Depending on the complexity of your struct it may however be just as complex to write than your mutex solution :)

JonB · wrote on 16 Jun 2020, 09:38

@npatil15

You do not want to try to create 25,000 threads for that number of files! :) As @mrjj & you have discovered, a small number.
[I see @J-Hilk has replied while I am typing this.] You could use mutexs, or you may find that sending a (queued, cross-thread) signal is faster, you'd have to test.
My guess, based on no experience and holding a finger to the wind, is that this approach will approximately halve your processing time. (Let us know!) However, if you can afford to do what I said about text-searching the files for the required string and only JSON-parsing the resultant hit files, and if that eliminates a large number of contenders (say >50%), this will make a big difference, as the JSON parsing will be really slow compared to reading/searching....

npatil15 · wrote on 16 Jun 2020, 09:45

@JonB Yeah I got the scenario for implementation.
I'll try to implement it, and get back with how it works compared to the previous implementation.