How to design a Qt file reader ?
-
Hi,
I wonder how to design a file reader class ( in a Qt fashion ) of different format. I may have different format of huge text file which are DNA sequence ( *.fasta *.fastq ) . What do you suggest to design 1 file reader to read all avaible format ?One class SequenceFile with different loader directly encoded inside :
SequenceFile::fromFasta("file.fasta") SequenceFile::fromFastq("file.fastq")
Or create Reader for each format : FastaReader , FastqReader ...
Sequence * seq = new Sequence(AbstractReader * reader)
The main problem is those files are huge and cannot be save into the memory..
If you have other proposal, thanks you -
@dridk
Hello,
What kind of files are you working with, binary or text (I see that you mentioned text, but are all of them such)? Can youseek
around them, do they have some kind of meta-information about the format embedded? What kind of data are you supposed to read from them?Kind regards.
-
@kshegunov
its related to this
https://forum.qt.io/topic/63463/model-with-large-data-from-file -
@kshegunov
I think a memory mapped file with sliding section would be perfect for his needs but
I have never tried it with Qt and the map function.
Have you used that functionality ? -
@mrjj , @dridk
No, I haven't used them, but any binary file will do. In my investigations of large-scale scintillation detectors signal processing I firstly mapped the text file to binary. This is necessary to at least have the ability for a less-than-a-lifetime seeking. One could put up a very simple "compression" as well, because you only have 4 bases in DNA, so you can encode each base with 2 bits only! This means that an allele will take less than a byte! If the files are not too large (2-4GB) employing such a scheme will even allow you to map the whole data into memory, which is the fastest by any standard. So, my advice is:- Convert the text file to a (possibly temporary) binary file on open (it'll take some time, but should be manageable)
- Use an appropriate encoding scheme for the data
- (Possibly) Have a simple
QAbstractItemModel
referencing that data (keeping offsets for the data stream should be sufficient) - Make a custom widget that draws the data
ADDENDUM
Back to your original question, which I actually forgot to answer, sorry:
Consider separating the presentation from the reading of the file. So you could have a class that reads and parses the data by accepting an openQFile
/QTextStream
instance. Same for the internal format you're using, if you decide to convert the file to a binary. This:SequenceFile::fromFasta("file.fasta");
doesn't seem very promising.
I'd suggest something of the following kind:class FastAReader : public QObject { Q_OBJECT public: FastAReader(QTextStream & ts) : QObject(), stream(ts) { } signals: void sectionReady(FastASectionData data); public: bool nextSection() { if (stream.atEnd()) return false; FastASectionData data; // ... Read a data section and fill up your `data` variable emit sectionReady(data); return true; } bool read() { // Read the whole file while (nextSection()) ; return stream.status() == QTextStream::Ok; } private: QTextStream & stream; }
For a class like this you could
QObject::connect
any processing object that does what you want to the data. It can be an object that writes it to your binary file, one that fills up your internal data representation or something of the sort. You could use it simply by providing a validQTextStream
and then invoking theread()
function.
Example usage:QFile file("myfile.fasta"); if (!file.open(QFile::Text | QFile::ReadOnly)) ; //< Can't open the file, handle error appropriately QTextStream stream(&file); //< Attach a stream to the file FastAReader reader(stream); //< Initialize the file reader and/or parser FastADataProcessor processor; //< This would hypothetically process the data QObject::connect(&reader, &FastAReader::sectionReady, &processor, &FastADataProcessor::processSection); if (!reader.read()) ; //< There was a problem reading the file, handle accordingly
I know it's not a complete solution but I hope it'll help you for a start.
Kind regards.
-
Thanks all for your reply. Yes it's text file, but I cannot compress because there is more than ACGT letter.
I didn't understand why it's faster to read binary than text file ? At the end, text file is also a binary file ?
I will test all of your solution and I let you know -
@dridk
Hello,
What I suggested is not compression per se, but a way to encode (meaning represent) base pair data more efficiently. As I noted, this is no way a complete solution, but I think it should give you a starting point. Since adenine is complementary to thymine the first could be encoded as a bit sequence 00 and the other as 11, while cytosine and guanine could be encoded as 01 and 10 respectively. This way you can get the complementary base by only inverting bits. Suppose you have encoded half the strain of DNA, then the complementary strain you get simply by inverting all the bits. Since the base data is only 2 bits fixed size you can use offsets to calculate where that data is exactly located in a long base pair sequence. Suppose you have a sequence of alleles and you know that some gene contains 3 alleles and starts with the 35th allele of the base-pair sequence, then you can access the gene sequence very easily. The gene should start at (35 - 1) * 3 = 102th base pair (or 102 * 2 = 204th bit) and the size is simply 9 base pairs or 18 bits. I just hope my biology is not failing me with the calculations. So if you had the whole sequence mapped in a binary file, to read up the gene you seek out the correct position directly by those offsets:QFile mySequenceFile("dnasequence.dna"); if (!file.open(QFile::ReadOnly)) ; //< You know the drill with handling errors file.seek(25); //< Go to the 25-th byte (200th bit) QByteArray geneSequenceData = file.read(3); //< Read 3 bytes (up to bit 224) // So in the byte array we've read we have the gene we're interested in, and it starts at the 4-th bit and ends at bit 22 // The total number of bits read is 24
The whole point of having a structured binary file is to be able to
seek
around it without actually reading things. Obviously my example is pretty superficial and it's much better to have special class that represent a base pair sequence, class representing gene offsets and other data you might want to handle. Additionally, you probably'd need some meta-information written in that file (offsets of sequences, genes or other things) so you could locate what you need. This is not possible with text files, especially in a platform independent fashion. Moreover a sequence of 4 base pairs you encode in 4 bytes when you use text files, with the proposed encoding scheme you only need a single byte!Kind regards.