Tool: Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing
14
gravatar for Matt Shirley
6.1 years ago by
Matt Shirley9.3k
Cambridge, MA
Matt Shirley9.3k wrote:

pyfaidx is a Python package for random access and indexing of fasta files. It uses the same indexing scheme as samtools faidx and therefore can generate/use the same "fasta.fai" indices. BioPython has something similar, although to my knowledge it creates an in-memory index of similar structure but does not allow re-using samtools index or writing the new index to a file. FastaHack is a C implementation with similar scope.

The main advantages of pyfaidx are:

  • produces and re-uses the samtools fasta index you already have
  • import it into your Python 2.6, 2.7, 3.2, 3.3, or pypy script without all of BioPython
  • it's entirely Python so you can easily hack and extend it
  • the faidx cli script is actually faster than samtools faidx (see below)

The indexing time for hg19 (~3.0GB) is comparable to samtools:

samtools:

real    0m38.077s   user    0m35.326s   sys    0m2.340s
real    0m40.551s   user    0m37.137s   sys    0m2.490s
real    0m39.788s   user    0m36.858s   sys    0m2.454s

pyfaidx:

real    0m46.735s   user    0m43.537s   sys    0m2.462s
real    0m54.161s   user    0m49.715s   sys    0m2.867s
real    0m52.432s   user    0m49.056s   sys    0m2.703s

Once the index is generated, samtools and pyfaidx access speeds somewhat surprising:

faidx (pyfaidx cli script):

time faidx hs_ref_GRCh37_multi.fa 'gi|224589801|ref|NC_000010.10|:1-100000000' > /dev/null
real    0m1.586s    user    0m1.095s    sys    0m0.460s 
real    0m1.548s    user    0m1.079s    sys    0m0.446s
real    0m1.505s    user    0m1.061s    sys    0m0.434s

samtools:

time samtools faidx hs_ref_GRCh37_multi.fa 'gi|224589801|ref|NC_000010.10|:1-100000000' > /dev/null
real    0m8.356s    user    0m8.190s    sys    0m0.128s
real    0m8.389s    user    0m8.205s    sys    0m0.149s
real    0m8.236s    user    0m8.094s    sys    0m0.126s

This is a 5.5X increase in subsequence retrieval speed! Thanks go out to @brentp for his contributions.

fasta python samtools tool • 6.9k views
ADD COMMENTlink modified 3.8 years ago by newdesign32100 • written 6.1 years ago by Matt Shirley9.3k
2

Pretty neat!

I will mention here that for a FASTA file to work with this method (and this applies to samtools faidx) the fasta files need to be wrapped with the same column length. This condition is usually satisfied but when it is not all kinds of subtle errors could occur.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Istvan Albert ♦♦ 83k

Yes, and I've been meaning to add this to the documentation as well as add a check during indexing. Thanks for the reminder!

ADD REPLYlink written 6.1 years ago by Matt Shirley9.3k

Great stuff. Does pyfaidx also support gzipped fasta files?

ADD REPLYlink written 6.1 years ago by Andreas2.4k

Not currently, and probably not ever, but I do think BioPython SeqIO might do something with bgzipped files.

ADD REPLYlink written 6.1 years ago by Matt Shirley9.3k

maybe it could be made to work with an razip'ed file. samtools can index and razip'ed fasta, right?

ADD REPLYlink written 6.1 years ago by brentp23k

Yes, that is true, and I think this would be feasible but will require me to write a python RAZip implementation, which I would like to do but might not have the time.

ADD REPLYlink written 6.1 years ago by Matt Shirley9.3k

Very useful tool with an impressive performance!

ADD REPLYlink written 5.6 years ago by jgreit0

I'm extracting from tens to hundreds of millions of regions from a fasta file, and as samtools faidx got too slow, I decided to give pyfaidx a chance. However, it choke completely or alternatively was even slower than samtools faidx (and I got tired of waiting and killed it). Neatly, fastaFromBed from bedtools2 turned out to be about 4x faster than samtools faidx..

ADD REPLYlink written 3.9 years ago by 5heikki8.7k

Can I ask how you were using pyfaidx? It's possible your performance was due to something other than fetching the sequence...

ADD REPLYlink written 3.8 years ago by Matt Shirley9.3k

Like this:

xargs faidx file.fa < regions.id > out
ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by 5heikki8.7k

This is not testing the performance of pyfaidx, but rather the performance of loading the Python interpreter and faidx program into memory. A better solution would be to pass the regions as a BED file:

$ wc -l < hg38.trf.bed 
432604
$ time faidx --bed hg38.trf.bed hg38.fa > /dev/null

real    0m41.210s
user    0m19.357s
sys 0m4.124s

$ time xargs samtools faidx hg38.fa < hg38.trf.regions > /dev/null

real    0m40.691s
user    0m3.025s
sys 0m7.670s

$ time bedtools getfasta -bed hg38.trf.bed -fi hg38.fa -fo /dev/null

real    0m4.732s
user    0m1.906s
sys 0m0.710s

Of course bedtools getfasta is always going to be the fastest for large numbers of regions, since it reads the entire FASTA file into memory and then subsets from in-memory arrays rather than seeking to file offsets - access times here are orders of magnitude different.

ADD REPLYlink written 3.8 years ago by Matt Shirley9.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2118 users visited in the last hour