Question

Parsing And Indexing Genomes In Embl Format Compressed With Gz/Xz

0

Entering edit mode

12.5 years ago

Pappu ★ 2.1k

I want to parse and index genomes in embl format compressed with gz/xz which I downloaded from EBI. The problem is that I can only work with the uncompressed files to list some features. But the uncompressed files are too big.

As far as I understood, if I compress them using bgzip from tabix, I can use them in Biopython [1] Can I directly index the xz file compressed with LZMA2 which gives much smaller file? It should be possible in principle [2] I am wondering if anyone has done it.

python • 2.9k views

ADD COMMENT • link updated 12.5 years ago by lh3 33k • written 12.5 years ago by Pappu ★ 2.1k

0

Entering edit mode

How much data are we talking about zipped/unzipped. It may be wiser to get more disk space. Random access to zipped data is complex, and I am a bit sceptic. Use UCSC 2bit format and indexing?

ADD REPLY • link 12.5 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Thank you for your info. Its around 110 GB with gz compression. Indexing will allow fast access to various parts of the file.

ADD REPLY • link 12.5 years ago by Pappu ★ 2.1k

score 6 · Answer 1 · 2013-01-09

The GZ format generated by gzip does not support random access, at least not with the cost of a huge index file. The GZ generated by bgzip supports random access. The bgzip output is actually a hack, which is not intended in the GZ format initially. The XZ format supports random access by design.

I do not know how biopython is structured. I am not sure it is possible/easy to use bgzip with the EMBL parser. If that is possible, you can to find the XZ python binding and pair it with the EMBL parser. On the other hand, I am not sure how much space lzma may save you. Maybe 30%? My opinion is for your 100GB file, 30% is not worth the headache. Gzip is faster on decompression, much faster on compression and has wider support.

That said, if I were ask to achieve random access in a huge EMBL file, I would convert it to a GFF for features and a FASTA for sequences. I would then compress GFF with bgzip and index GFF with tabix; compress FASTA with razip from samtools and index with faidx. When I need to retrieve features, I use the tabix python binding or pysam or call tabix to query GFF; when I need to retrieve sequences, I call samtools faidx (I do not know if pysam supports compressed faidx; if it does, you can use pysam instead of calling samtools). This way is faster than directly working on a EMBL file and does not require you to write a lot of new tools. Nonetheless, if you would like to practice your programming skill, you can write a compressed EMBL indexer, which is going to take quite some time. I frequently waste my time on such projects, but in the long end, it pays off.