Process very large txt file
-
I have a machine performing some specific measurements and as a result it produces a txt file containing a very huge amount of values (up to millions of elements). Each element has this form:
Image Name_Metric type_AOI Name_Descriptive Statistic/tImage name_Metric type_AOI name_Descriptive Statistic/t etc etc
Each group of fields is separated by 'tab' characher and is composed by:- Image name
- Metric type
- AOI name
- Descriptive Statistic
In the first line of the file I have the groups separated by /t and in the following lines I have the values (1 value per each group). The produced files can contain hundreds of thousands (or even more) groups (each group with its values). In my processing I have to read all the lines and group the groups by the Image Name, Metric Type and AOI Name in this way:
Image Name 1 Metric Type 1 (of image name 1) AOI Name 1 Descriptive Stat 1: Value 1 Descriptive Stat 2: Value 2 Descriptive Stat x: Value x AOI Name 2 Descriptive Stat 1: Value 1 Descriptive Stat 1: Value 2 Descriptive Stat 1: Value x AOI Name x Descriptive Stat 1: Value 1 Descriptive Stat 1: Value 2 Descriptive Stat 1: Value x Metric Type x (of image name 1) ... Image Name x Metric Type 1 (of image name x) AOI Name x (of Metric Type x) Descriptive Stat 1: Value x
To contain all this data I though to use a
std::unordered_map<QString, std::unordered_map<QString, std::vector<AOI>>> result;
To process the file and feed the data structure I use this code:
// Metric Data Structures struct Metric { QString descriptiveStat; QString value; }; struct AOI { QString aoiName; std::vector<Metric> metrics; }; // Function to parse the file and extract metric data std::unordered_map<QString, std::unordered_map<QString, std::vector<AOI>>> extractMetricData(const QString &filePath) { std::unordered_map<QString, std::unordered_map<QString, std::vector<AOI>>> result; QFile file(filePath); if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) { qWarning() << "Failed to open file:" << file.errorString(); return result; // Return an empty map on failure } QTextStream in(&file); QString firstLine = in.readLine(); QString secondLine = in.readLine(); file.close(); // Close file after reading necessary lines // Input Validation if (!secondLine.startsWith("All Recordings\t")) { qWarning() << "Invalid second line format. Expected 'All Recordings' at the beginning."; return result; } QStringList metricDataGroups = firstLine.split('\t', Qt::SkipEmptyParts); QStringList valueStrings = secondLine.mid(15).split('\t', Qt::SkipEmptyParts); if (metricDataGroups.size() != valueStrings.size()) { qWarning() << "Number of metric groups and values don't match."; return result; } // Reserve space in vectors for potential performance improvement result.reserve(metricDataGroups.size() / 4); // Estimate number of unique images // Process metric data groups for (int i = 0; i < metricDataGroups.size(); ++i) { QStringList metricFields = metricDataGroups[i].split('_', Qt::SkipEmptyParts); // Input Validation (at least 4 fields are required) if (metricFields.size() < 4) { qWarning() << "Invalid metric data format (missing fields):" << metricDataGroups[i]; continue; // Skip this entry } // Extract metric information QString imageName = metricFields[1]; QString metricTypeName = metricFields[0]; QString aoiName = metricFields[2]; QString descriptiveStat = metricFields[3]; QString value = valueStrings[i]; Metric metric{descriptiveStat, value}; // Create AOI with the single metric AOI aoi {aoiName, {metric}}; // Create or update the nested structure directly result[imageName][metricTypeName].push_back(std::move(aoi)); // Push the AOI into the vector } return result; // Return the nested map structure } void MainWindow::processFile(const QString &filePath) { std::unordered_map<QString, std::unordered_map<QString, std::vector<AOI>>> metricsByImageAndType = extractMetricData(filePath); // Get the map here if (metricsByImageAndType.empty()) { ui->textEdit->setText("No valid metric data found in the file."); return; } QString output; for (const auto &imageEntry : metricsByImageAndType) { // Image name (the key) const QString &imageName = imageEntry.first; output += "<br><b>Image: " + imageName + "</b><br>"; for (const auto& typeEntry : imageEntry.second) { // Iterate over metric types in the image map // Metric type (the key) const QString& metricTypeName = typeEntry.first; // AOIs for this type const std::vector<AOI>& aois = typeEntry.second; output += "<br><i>Metric Type: " + metricTypeName + "</i><br>"; // Iterate over AOIs of this type for (const AOI& aoi : aois) { for (const Metric &metric : aoi.metrics) { output += " AOI Name: " + aoi.aoiName + "<br>"; output += " " + metric.descriptiveStat + ": " + metric.value + "<br>"; } } } } ui->textEdit->setText(output); }
I would like to have an opinion regarding the data structure I used or if is there a better option. Also if reading a very large line with QTextStream::readLine() is a good idea or if It can give some problem.
Lastly, to represent the data I would like to use a QTreeView/QTreeWidget. To populate it should I keep in memory a copy of the already created data structure or can I delete it after the tree feeding? I don't want to waste memory by keeping unnecessary data. The data can be used in a second moment to create some graph -
Hi,
Do you have control over the output of that other application ?
That said, I am wondering if you should not consider the use of a library such as polars to process your data before showing it.
As for your memory management question, if you build your model properly, you'll have all the data ready available.
-
Hi unfortunately not, I don't have control over the application that produces the txt file, it's a binary executable.
So after building the model I can delete the unordered_map? Is a tree containing millions of elements fast to be parsed for a post operation? For example with the data I have to create 2d/3d plots. About polars is for python application? My application is written in C++/Qt -
Then DataFrame might be if interest (note that I haven't used it).
As for you model, since you want to build a tree view, you will need to have some order to access the data so you should consider building it out of your base data directly.
Often, custom Qt models are built as wrappers on top of other data structure so you don't need multiple copies of the data.
-
I am giving a look at the DataFrame you suggested, it looks promising. Does it support a tree-like data format?
A last question please: to read the file I am currently using the QTextStream::readLine(). Does it support very large lines? -
I am giving a look at the DataFrame you suggested, it looks promising. Does it support a tree-like data format?
A last question please: to read the file I am currently using the QTextStream::readLine(). Does it support very large lines?@franco-amato
QTextStream::readLine()
does not mention any line length limit, and it accepts an optionalqint64 maxlen = 0
parameter. Sinceqint64
is larger than however much memory you have, you can assume that will be the only barrier! It should cope with terabytes of line length... :) -
@franco-amato said in Process very large txt file:
If DataFrame manages its own data structure, then creating a tree view will result in a copy of the data?
Although I have not looked at it, in principle it should not require a copy of data for a
QTreeView
.QTreeView
requires (something derived from) aQAbstractItem
/TableModel
for its data model. But that does not require you to copy from your underlying data, it only requires you to provide an interface to it, supplying certain required methods. Which you can write to provide read/write access to your data directly, without copying, all being well. As @SGaist wrote:Often, custom Qt models are built as wrappers on top of other data structure so you don't need multiple copies of the data.