6.3 years ago by
The GZ format generated by gzip does not support random access, at least not with the cost of a huge index file. The GZ generated by bgzip supports random access. The bgzip output is actually a hack, which is not intended in the GZ format initially. The XZ format supports random access by design.
I do not know how biopython is structured. I am not sure it is possible/easy to use bgzip with the EMBL parser. If that is possible, you can to find the XZ python binding and pair it with the EMBL parser. On the other hand, I am not sure how much space lzma may save you. Maybe 30%? My opinion is for your 100GB file, 30% is not worth the headache. Gzip is faster on decompression, much faster on compression and has wider support.
That said, if I were ask to achieve random access in a huge EMBL file, I would convert it to a GFF for features and a FASTA for sequences. I would then compress GFF with bgzip and index GFF with tabix; compress FASTA with razip from samtools and index with faidx. When I need to retrieve features, I use the tabix python binding or pysam or call tabix to query GFF; when I need to retrieve sequences, I call samtools faidx (I do not know if pysam supports compressed faidx; if it does, you can use pysam instead of calling samtools). This way is faster than directly working on a EMBL file and does not require you to write a lot of new tools. Nonetheless, if you would like to practice your programming skill, you can write a compressed EMBL indexer, which is going to take quite some time. I frequently waste my time on such projects, but in the long end, it pays off.
modified 6.3 years ago
6.3 years ago by
lh3 ♦ 31k