Tool:Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing
0
17
Entering edit mode
7.7 years ago

pyfaidx is a Python package for random access and indexing of fasta files. It uses the same indexing scheme as samtools faidx and therefore can generate/use the same "fasta.fai" indices. BioPython has something similar, although to my knowledge it creates an in-memory index of similar structure but does not allow re-using samtools index or writing the new index to a file. FastaHack is a C implementation with similar scope.

The main advantages of pyfaidx are:

  • produces and re-uses the samtools fasta index you already have
  • import it into your Python 2.6, 2.7, 3.2, 3.3, or pypy script without all of BioPython
  • it's entirely Python so you can easily hack and extend it
  • the faidx cli script is actually faster than samtools faidx (see below)

The indexing time for hg19 (~3.0GB) is comparable to samtools:

samtools:

real    0m38.077s   user    0m35.326s   sys    0m2.340s
real    0m40.551s   user    0m37.137s   sys    0m2.490s
real    0m39.788s   user    0m36.858s   sys    0m2.454s

pyfaidx:

real    0m46.735s   user    0m43.537s   sys    0m2.462s
real    0m54.161s   user    0m49.715s   sys    0m2.867s
real    0m52.432s   user    0m49.056s   sys    0m2.703s

Once the index is generated, samtools and pyfaidx access speeds somewhat surprising:

faidx (pyfaidx cli script):

time faidx hs_ref_GRCh37_multi.fa 'gi|224589801|ref|NC_000010.10|:1-100000000' > /dev/null
real    0m1.586s    user    0m1.095s    sys    0m0.460s 
real    0m1.548s    user    0m1.079s    sys    0m0.446s
real    0m1.505s    user    0m1.061s    sys    0m0.434s

samtools:

time samtools faidx hs_ref_GRCh37_multi.fa 'gi|224589801|ref|NC_000010.10|:1-100000000' > /dev/null
real    0m8.356s    user    0m8.190s    sys    0m0.128s
real    0m8.389s    user    0m8.205s    sys    0m0.149s
real    0m8.236s    user    0m8.094s    sys    0m0.126s

This is a 5.5X increase in subsequence retrieval speed! Thanks go out to @brentp for his contributions.

python fasta samtools Tool • 8.3k views
ADD COMMENT
2
Entering edit mode

Pretty neat!

I will mention here that for a FASTA file to work with this method (and this applies to samtools faidx) the fasta files need to be wrapped with the same column length. This condition is usually satisfied but when it is not all kinds of subtle errors could occur.

ADD REPLY
0
Entering edit mode

Yes, and I've been meaning to add this to the documentation as well as add a check during indexing. Thanks for the reminder!

ADD REPLY
0
Entering edit mode

Great stuff. Does pyfaidx also support gzipped fasta files?

ADD REPLY
0
Entering edit mode

Not currently, and probably not ever, but I do think BioPython SeqIO might do something with bgzipped files.

ADD REPLY
0
Entering edit mode

maybe it could be made to work with an razip'ed file. samtools can index and razip'ed fasta, right?

ADD REPLY
0
Entering edit mode

Yes, that is true, and I think this would be feasible but will require me to write a python RAZip implementation, which I would like to do but might not have the time.

ADD REPLY
0
Entering edit mode

Very useful tool with an impressive performance!

ADD REPLY
0
Entering edit mode

I'm extracting from tens to hundreds of millions of regions from a fasta file, and as samtools faidx got too slow, I decided to give pyfaidx a chance. However, it choke completely or alternatively was even slower than samtools faidx (and I got tired of waiting and killed it). Neatly, fastaFromBed from bedtools2 turned out to be about 4x faster than samtools faidx..

ADD REPLY
0
Entering edit mode

Can I ask how you were using pyfaidx? It's possible your performance was due to something other than fetching the sequence...

ADD REPLY
0
Entering edit mode

Like this:

xargs faidx file.fa < regions.id > out
ADD REPLY
0
Entering edit mode

This is not testing the performance of pyfaidx, but rather the performance of loading the Python interpreter and faidx program into memory. A better solution would be to pass the regions as a BED file:

$ wc -l < hg38.trf.bed 
432604
$ time faidx --bed hg38.trf.bed hg38.fa > /dev/null

real    0m41.210s
user    0m19.357s
sys 0m4.124s

$ time xargs samtools faidx hg38.fa < hg38.trf.regions > /dev/null

real    0m40.691s
user    0m3.025s
sys 0m7.670s

$ time bedtools getfasta -bed hg38.trf.bed -fi hg38.fa -fo /dev/null

real    0m4.732s
user    0m1.906s
sys 0m0.710s

Of course bedtools getfasta is always going to be the fastest for large numbers of regions, since it reads the entire FASTA file into memory and then subsets from in-memory arrays rather than seeking to file offsets - access times here are orders of magnitude different.

ADD REPLY

Login before adding your answer.

Traffic: 2981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6