@dridk
Hello,
What I suggested is not compression per se, but a way to encode (meaning represent) base pair data more efficiently. As I noted, this is no way a complete solution, but I think it should give you a starting point. Since adenine is complementary to thymine the first could be encoded as a bit sequence 00 and the other as 11, while cytosine and guanine could be encoded as 01 and 10 respectively. This way you can get the complementary base by only inverting bits. Suppose you have encoded half the strain of DNA, then the complementary strain you get simply by inverting all the bits. Since the base data is only 2 bits fixed size you can use offsets to calculate where that data is exactly located in a long base pair sequence. Suppose you have a sequence of alleles and you know that some gene contains 3 alleles and starts with the 35th allele of the base-pair sequence, then you can access the gene sequence very easily. The gene should start at (35 - 1) * 3 = 102th base pair (or 102 * 2 = 204th bit) and the size is simply 9 base pairs or 18 bits. I just hope my biology is not failing me with the calculations. So if you had the whole sequence mapped in a binary file, to read up the gene you seek out the correct position directly by those offsets:

QFile mySequenceFile("dnasequence.dna"); if (!file.open(QFile::ReadOnly)) ; //< You know the drill with handling errors file.seek(25); //< Go to the 25-th byte (200th bit) QByteArray geneSequenceData = file.read(3); //< Read 3 bytes (up to bit 224) // So in the byte array we've read we have the gene we're interested in, and it starts at the 4-th bit and ends at bit 22 // The total number of bits read is 24

The whole point of having a structured binary file is to be able to seek around it without actually reading things. Obviously my example is pretty superficial and it's much better to have special class that represent a base pair sequence, class representing gene offsets and other data you might want to handle. Additionally, you probably'd need some meta-information written in that file (offsets of sequences, genes or other things) so you could locate what you need. This is not possible with text files, especially in a platform independent fashion. Moreover a sequence of 4 base pairs you encode in 4 bytes when you use text files, with the proposed encoding scheme you only need a single byte!

Kind regards.