Best strategy to process large Qsqlquery

Burrito

Hey,

I am trying to develop a tool to process raw data which I receive from a MS SQL database and store the processed (sorted) data, to avoid repeated query of the same data as the results can contain up to a few million rows.

The raw data received consists of the fields [Date] as ticks (BIG INT), [Data type] as INT and [Data value] as REAL. Thereby, it is possible to receive multiple rows of [Data type] and [Data value] for the same [Date].

What i basically want is to receive the data, sort it into a vector of the struct for further processing:

struct raw
{
   long long int t_datetime;
   int t_type;
   double t_value;
};

I wrote the following code to query the rawdata:

QSqlDatabase serverdb;                  

 //Connect database server via serverid;
connectServerByID(serverdb, serverid); // Looks up existing connection or creates new one and performs db.open();

//Create rawresult vector for 
std::vector<raw> rawresults;

// Create qry3 string from query-generator getConnectionStringRaw
QString qry3 = getConnectionStringRaw(databases,starttime,endtime);
qDebug() << "Generated query: " << qry3 << endl;
QSqlQuery query(serverdb);
qDebug() << "starting query" << endl;

query.exec(qry3);
   while(query.next())
   {
           QSqlRecord rec = query.record();

           for(int i=0; i<rec.count(); ++i)
           {
                  raw datapoint;

                  datapoint.t_datetime = query.value(0).toLongLong();
                  datapoint.t_type = query.value(1).toInt();
                  datapoint.t_value = query.value(2).toDouble();
                  rawresults.push_back(datapoint);
           }
   }
   qDebug() << "result count: " << rawresults.size() << endl;

In general the code works. The complete loop over a query with 2.4 mio result rows without sorting takes around 40 secs, which is a perfectly acceptable time for me.
Unfortunately, the moment i try to add the received data into the vector via push_back or sort them via a switch statement directly into the final structure, the process takes around 45 minutes.

It seems the way I try to process the date is very inefficient, so I would be glad to receive advice for a more elegant solution.

Thanks you :)

jsulm

@Burrito said in Best strategy to process large Qsqlquery:

The complete loop over a query with 2.4 mio result rows without sorting

If you really have to fetch such amount of data and put it into a vector (not sure what the use case is) you should do it in a thread.
And sorting is better done on SQL server side (ORDER BY).

Pablo J. Rogina

@jsulm said in Best strategy to process large Qsqlquery:

And sorting is better done on SQL server side (ORDER BY).

Yes, and you can also "avoid repeated query of the same data" by using SQL DISTINCT statement.

Burrito

@jsulm I already sort the the data using ORDER BY Date ASC;

What I what as the end result is to put all received data into a local sqlite table of form:

Date | ValueOfDataType1|ValueOfDataType2|ValueOfDataType3|ValueOfDataType4|....|

as I want to perform different calculations and plottings of the columns later on.

If I understand you right you suggest to split the query into x seperate queries, where I define e.g.
WHERE DataType=1 and later on combine all of the queries?

jsulm

@Burrito said in Best strategy to process large Qsqlquery:

you suggest to split the query into x seperate queries

No
I only suggested to do this long lasting stuff in a thread to not to block your main thread.

JonB

@Burrito
If you are saying you want to eliminate repeated occurences of rows with same date you want to do that at SQL Server side. Hopefully with @Pablo-J-Rogina 's SELECT DISTINCT. You really do not want to be sorting that many rows in your client Qt code, as you have discovered.

But I'm not sure that is what you're saying.

Date | ValueOfDataType1|ValueOfDataType2|ValueOfDataType3|ValueOfDataType4|....|

Explain what you mean here. Are all these values being stored as (a fixed number of) columns in your SQLite? Or is it a string?? Or what?

It looks like you want to do some coalescing/processing of your data rows? If it were me, and I cared about speed, I would do all of that in a SQL query up at the SQL server, wildly more efficient both in speed and output volume. I don't know how much work you want to put into the SQL Server side code vs in your Qt client.

Burrito

@JonB Maybe I didn´t make that clear enough!

The data I want to process is coming from a scientific instrument, so I can obtain multiple parameters per date point, but only data_value type 1 is always present.

The tables typically look like this:

Date |  DataType |  DataValue
----------------------------
15535268742988000| 1 | 0.136423
15535268742988000| 2 | 12.151334
15535268742988000| 3 | 0.423564
15535268742988000| 4 | 0.001244
15535268743538000| 1 | 0.215352
15535268744538000| 1 | 1.134652
15535268744538000| 2 | 3.123412
15535268842988000| 1 | 0.113451
15535268842988000| 2 | 18.151334
15535268842988000| 3 | 1.456564
15535268842988000| 4 | 1.001244

So I want to create a row for each Date value (as one measurement point) and sort all available values into specific columns according to their DataType.

The result for the example should look like this:

Date | Data1 | Data2 | Data3 | Data4
-----------------------------------
15535268742988000 | 0.136423 | 12.151334 | 0.423564 | 0.001244
15535268743538000 | 0.215352 |                      |             
15535268744538000 | 1.134652 | 3.123412  |          |
15535268842988000 | 0.113451 | 18.151334 | 1.456564 | 1.001244

jsulm

@Burrito Why do you want to do this actually? Wouldn't this mean a variable number of columns?

JonB

@Burrito
Exactly as @jsulm just asked you? That is why you'll see I wrote

Are all these values being stored as (a fixed number of) columns in your SQLite?

It's not going to work if you don't know how many columns you want. Get this decided before we proceed!

I'm getting confused as to whether you're saying there are only every 4 columns?

And are you saying the combination Date + DataType is unique, i.e. only ever one (or zero) rows for that?

Burrito

@JonB said in Best strategy to process large Qsqlquery:

. Are all these values being stored as (a fixed number of) columns in your SQLite? Or is it a string?? Or what?

@JonB & @jsulm Sorry, I didn´t see that :)
The number of columns is fixed to 9, there are only 8 different DataTypes plus the Date value.

I need all that data, because I want to create different plots from that e.g. Data1 vs. Data2 or Date vs. Date3, but they need to be processed (e.g. division by a chosen factor) for that, so I can´t use the raw data as received from the SQL server directly.

JonB

@Burrito
Your post has crossed with mine. I need to know:

And are you saying the combination Date + DataType is unique, i.e. only ever one (or zero) rows for that?

I also need to know if you are prepared to do some smart SQL Server code, or whether you only want to write stuff in your Qt client?

Burrito

Sorry for the long response time, I can only answer every 10 min.

@JonB: Yes, the combination of Date + DateType is always unique.

I´m not very experienced with SQL apart from simple queries, but of course I´m interested if SQL Server Code is the optimum solution for such a process!

If you could give me a few key words or an example I would be very grateful to dive into the matter!

JonB

@Burrito
OK, at least I understand now :)

Then I believe/hope you are in luck! T-SQL has a PIVOT operator. If I understand right, it does exactly what you want, all up at the server, efficiently!

You have a rowset. It has a unique key of columns Date + DataType. What you want is to turn those rows into columns, where the columns are dictated by the value of DataType, with separate columns for each value in DataType.

https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver15

You can use the PIVOT and UNPIVOT relational operators to change a table-valued expression into another table. PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output. And PIVOT runs aggregations where they're required on any remaining column values that are wanted in the final output. UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.

https://www.sqlservertutorial.net/sql-server-basics/sql-server-pivot/ claims to explain it simply, but I'm not sure I find it much clearer... :)

I hope this corresponds to what you are trying to do. I have used it before. I'm trying to read up on it now... :)

Burrito

@JonB Thank you very much, the two links are exactly what I was searching for. I will try around with it now =)

JonB

@Burrito
I'm working off the Basic PIVOT Example on https://docs.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-ver15. I don't have MS SQL to hand, no testing, so this is brainstorm. You'll have to play with it. (Obviously, what you want to do first is practice the query in MS SQL Server Management Studio, assuming you have access.)

I'm thinking something like:

SELECT [Date],   
[Data1], [Data2], [Data3], [Data4]  
FROM  
(SELECT [Date], [DataType], [DataValue]
    FROM [table]) AS SourceTable  
PIVOT  
(  
ANY([Date]) , ANY([DataValue]) 
FOR [DataType] IN ([1], [2], [3], [4])  
) AS PivotTable
ORDER BY PivotTable.[Date];

The bit I'm not sure about is where I've written

ANY([Date]) , ANY([DataValue])

That might be right. It might be without the ANY() around around the column names.

The point is that example/PIVOT usually aggregates values where there are multiple ones, e.g. AVG() in the example. In our case we know the values are unique for Date + DataValue, but it does not require that. I am thinking ANY() will keep it happy, but I'm not sure.

There is doubtless an example for this case somewhere out there, e.g. on stackoverflow, if you get stuck.

Obviously, please let me know, I'm interested! If & when you get it working it's going to go like a rocket, hopefully more like your your original 40 seconds than 40 minutes :)

Otherwise, thinking aloud, if we fail miserably to get this PIVOT working we can always still write your query with a CASE and a JOIN and a GROUP BY, which must be what the PIVOT is doing....

kshegunov

@Burrito said in Best strategy to process large Qsqlquery:

So I want to create a row for each Date value (as one measurement point) and sort all available values into specific columns according to their DataType

I have a question. Why do you want to do this to begin with? Put an index spanning Date and DataType (i.e primary key) and an index on DataType and just operate directly would be my go to. I'm missing the point of this row aggregation, I guess, could you explain?

Burrito

@kshegunov The raw data tables are automatically created by the measurement software (which I can´t modify) and there is no unique key or primary key in that tables. Therefore I need to perform something like "Select * from table where devicechannel = X and DateTime between x and y" to receive the wanted data. I need to specify the time range I want to select as Devicechannel is not a unique value. I´m not sure if I understand you right, but as I understand it what you suggested is not possible with this table?

@JonB Thank you for the code to start with! I had to slightly modify it as the SSMS gave me syntax errors.

This one works fine in SSMS:

SELECT * 
FROM  
(SELECT [DateTime], [DataType], [DataValue]
    FROM Database.Table) AS SourceTable  
PIVOT  
(  
	MAX(DataValue)
	FOR DataType 
	IN ([1], [2], [3], [4])  
) AS PivotTable
ORDER BY PivotTable.[DateTime];

For anyone interested and looking for an explaination:
What I had to do is use MAX (or MIN, doesnt matter) as aggregate function. As you pointed out, every combination of DateTime and DataType is unique -> there is always only one MAX value of DataValue per DateTime and DataType.

The DateTime argument is not needed in the PIVOT statement, it just needs to be selected in the select statement of the initial query and is automatically assigned to a column of the PIVOT table.

What I still wonder about is the performance difference between the query in SSMS and my QT application:
SSMS: 28 secs
Query with QT (without iteration over all record elements): 257 secs
Query with QT and pushing all data into a vector: 260 secs

It seems the vector.push_back is not the problem as it does not significantly increase the required time, however, there is a factor of ~10 between SSMS and the QT query.

Anyone an idea how to overcome this?

JonB

@Burrito
Thanks for posting, and adjusting as required from mine. Glad you have this working!

Timings. Let's be clear, you are running both your Qt program & SSMS from the same machine? And the SQL Server is local to that machine, or remote? Just how many rows does your pivoted query return, approximately? And when you say "Query with QT (without iteration over all record elements)" what exactly are you doing in your Qt code to execute the query? Is it exactly as shown originally, with what removed, or do you not even do the query.next(), or... ?

Burrito

@JonB

The QT program and SSMS are running on the same machine, the SQLEXPRESS server app is running on a linux machine on the network. The query and the pivoted result is approx 240k instead of 2.5 mio. inital rows.

I just realised that my "query without interation" still iterated over all QSqlRecord records - just didn´t do anything with it. Running just the empty while(query.next() loop148 secs. So I already "halfed" my query time by removing QSqlRecord, however, using the following code without QSqlRecord:

while(query.next())
{
     rawResult singleresult;

     singleresult.Date= query.value(0).toLongLong();
     singleresult.Data1= query.value(1).toDouble();
     singleresult.Data2 = query.value(2).toDouble();
     singleresult.Data3 = query.value(3).toDouble();
     singleresult.Data4 = query.value(4).toDouble();
     singleresult.Data5= query.value(5).toDouble();
     singleresult.Data6 = query.value(6).toDouble();
     singleresult.Data7 = query.value(7).toDouble();
     singleresult.Data8 = query.value(8).toDouble();

     rawresults.push_back(singleresult);
}

The time is 270 secs again. Just executing the query.exex(qry3) needs 23 secs just like SSMS.

JonB

@Burrito
Hmm. Where to guess....

The driver used by SSMS is, I think, some kind of "SQL native query" driver, I think the one used by Qt may be different. You specify QODBC https://doc.qt.io/qt-5/sql-driver.html#qodbc to QSqlDatabase::addDatabase, right?

https://doc.qt.io/qt-5/sql-driver.html#qodbc-for-open-database-connectivity-odbc

Note: You should use the native driver, if it is available, instead of the ODBC driver. ODBC support can be used as a fallback for compliant databases if no native driver is available.

Unlike for, say, MySQL you have no choice from Qt for MS SQL. There may be an overhead there.

I need you test/confirm what happens when you do just:

while(query.next())
{
}

Is that, with no query.value() calls, your "needs 23 secs just like SSMS`.

In which case, pulling the values out of the rows is costly. You are doing it via a non-native ODBC driver, SSMS is doing it via a native SQL client one. There are are 240k * 9 extract/converts to do. Comment out half the query.value() lines, does the time halve?

You may as well also set https://doc.qt.io/qt-5/qsqlquery.html#setForwardOnly. From what you say that does not seem to be the issue, but put it in anyway.