Entering edit mode
11.1 years ago
daattali
▴
50
Hi,
I'm new to genome compression, and I was reading through this recent paper and some of their results left me with an unanswered question (paper available here)
In Table 5, they show that by simply stripping away all non-essential fields of a VCF file and then compressing it with 7z, it achieves excellent compression compared with other genome compression algorithms (1.7MB for human genome). It made me wonder why this hasn't been used if it's so simple?
...particularly the second point (though pragmatically life is much easier for computational biologists using linux :)
...the point being not bgzip and tabix specifically, but that random access is important so that, together with some kind of index, you can efficiently answer questions about a subset of your data -- e.g., show me the variants in some particular region of the genome.
That makes sense, I didn't know that random access is such a high priority. Thanks.