How to design a Qt file reader ?

dridk · 30 Jan 2016, 14:07

Hi,
I wonder how to design a file reader class ( in a Qt fashion ) of different format. I may have different format of huge text file which are DNA sequence ( *.fasta *.fastq ) . What do you suggest to design 1 file reader to read all avaible format ?

One class SequenceFile with different loader directly encoded inside :

SequenceFile::fromFasta("file.fasta")
SequenceFile::fromFastq("file.fastq")

Or create Reader for each format : FastaReader , FastqReader ...

 Sequence * seq = new Sequence(AbstractReader * reader)

The main problem is those files are huge and cannot be save into the memory..
If you have other proposal, thanks you

kshegunov · 30 Jan 2016, 14:11

@dridk
Hello,
What kind of files are you working with, binary or text (I see that you mentioned text, but are all of them such)? Can you seek around them, do they have some kind of meta-information about the format embedded? What kind of data are you supposed to read from them?

Kind regards.

mrjj · 30 Jan 2016, 14:12

@kshegunov
its related to this
https://forum.qt.io/topic/63463/model-with-large-data-from-file

kshegunov · 30 Jan 2016, 14:14

@mrjj
Oh, ok I'll look that up it was not referenced in the original post, so I had not idea. Thanks for the link.

mrjj · 30 Jan 2016, 14:23

@kshegunov
I think a memory mapped file with sliding section would be perfect for his needs but
I have never tried it with Qt and the map function.
Have you used that functionality ?

kshegunov · M mrjj 30 Jan 2016, 14:14

@mrjj , @dridk
No, I haven't used them, but any binary file will do. In my investigations of large-scale scintillation detectors signal processing I firstly mapped the text file to binary. This is necessary to at least have the ability for a less-than-a-lifetime seeking. One could put up a very simple "compression" as well, because you only have 4 bases in DNA, so you can encode each base with 2 bits only! This means that an allele will take less than a byte! If the files are not too large (2-4GB) employing such a scheme will even allow you to map the whole data into memory, which is the fastest by any standard. So, my advice is:

Convert the text file to a (possibly temporary) binary file on open (it'll take some time, but should be manageable)
Use an appropriate encoding scheme for the data
(Possibly) Have a simple QAbstractItemModel referencing that data (keeping offsets for the data stream should be sufficient)
Make a custom widget that draws the data

ADDENDUM

Back to your original question, which I actually forgot to answer, sorry:
Consider separating the presentation from the reading of the file. So you could have a class that reads and parses the data by accepting an open QFile/QTextStream instance. Same for the internal format you're using, if you decide to convert the file to a binary. This:

SequenceFile::fromFasta("file.fasta");

doesn't seem very promising.
I'd suggest something of the following kind:

class FastAReader : public QObject
{
    Q_OBJECT

public:
    FastAReader(QTextStream & ts)
        : QObject(), stream(ts)
    {
    }

signals:
    void sectionReady(FastASectionData data);

public:
    bool nextSection()
    {
         if (stream.atEnd())
             return false;

         FastASectionData data;
         // ... Read a data section and fill up your `data` variable

         emit sectionReady(data);
         return true;
    }

    bool read()
    {
        // Read the whole file
        while (nextSection())
            ;

        return stream.status() == QTextStream::Ok;
    }

private:
    QTextStream & stream;
}

For a class like this you could QObject::connect any processing object that does what you want to the data. It can be an object that writes it to your binary file, one that fills up your internal data representation or something of the sort. You could use it simply by providing a valid QTextStream and then invoking the read() function.
Example usage:

QFile file("myfile.fasta");
if (!file.open(QFile::Text | QFile::ReadOnly))
    ; //< Can't open the file, handle error appropriately

QTextStream stream(&file); //< Attach a stream to the file
FastAReader reader(stream); //< Initialize the file reader and/or parser

FastADataProcessor processor; //< This would hypothetically process the data
QObject::connect(&reader, &FastAReader::sectionReady, &processor,  &FastADataProcessor::processSection);

if (!reader.read())
    ; //< There was a problem reading the file, handle accordingly

I know it's not a complete solution but I hope it'll help you for a start.

Kind regards.

dridk · wrote on 30 Jan 2016, 21:11

This post is deleted!

dridk · wrote on 30 Jan 2016, 21:17

Thanks all for your reply. Yes it's text file, but I cannot compress because there is more than ACGT letter.
I didn't understand why it's faster to read binary than text file ? At the end, text file is also a binary file ?
I will test all of your solution and I let you know

kshegunov · wrote on 31 Jan 2016, 00:16

@dridk
Hello,
What I suggested is not compression per se, but a way to encode (meaning represent) base pair data more efficiently. As I noted, this is no way a complete solution, but I think it should give you a starting point. Since adenine is complementary to thymine the first could be encoded as a bit sequence 00 and the other as 11, while cytosine and guanine could be encoded as 01 and 10 respectively. This way you can get the complementary base by only inverting bits. Suppose you have encoded half the strain of DNA, then the complementary strain you get simply by inverting all the bits. Since the base data is only 2 bits fixed size you can use offsets to calculate where that data is exactly located in a long base pair sequence. Suppose you have a sequence of alleles and you know that some gene contains 3 alleles and starts with the 35th allele of the base-pair sequence, then you can access the gene sequence very easily. The gene should start at (35 - 1) * 3 = 102th base pair (or 102 * 2 = 204th bit) and the size is simply 9 base pairs or 18 bits. I just hope my biology is not failing me with the calculations. So if you had the whole sequence mapped in a binary file, to read up the gene you seek out the correct position directly by those offsets:

QFile mySequenceFile("dnasequence.dna");
if (!file.open(QFile::ReadOnly))
    ; //< You know the drill with handling errors

file.seek(25); //< Go to the 25-th byte (200th bit)
QByteArray geneSequenceData = file.read(3);  //< Read 3 bytes (up to bit 224)

// So in the byte array we've read we have the gene we're interested in, and it starts at the 4-th bit and ends at bit 22
// The total number of bits read is 24

The whole point of having a structured binary file is to be able to seek around it without actually reading things. Obviously my example is pretty superficial and it's much better to have special class that represent a base pair sequence, class representing gene offsets and other data you might want to handle. Additionally, you probably'd need some meta-information written in that file (offsets of sequences, genes or other things) so you could locate what you need. This is not possible with text files, especially in a platform independent fashion. Moreover a sequence of 4 base pairs you encode in 4 bytes when you use text files, with the proposed encoding scheme you only need a single byte!

Kind regards.