Question

Sub-Record Indexing Of Fasta Files In Python?

1

Entering edit mode

10.7 years ago

interfect ▴ 10

Hello,

I have several genomes, each spread across multiple FASTA files. Some of these files contain multiple short sequences, and some of them contain large, chromosome-sized sequences. I need efficient random access to sequence data by FASTA record ID and sequence location (e.g. chr1:12384-12884) from Python.

I am currently using BioPython to read all of the FASTA files for a genome and store the SeqRecords in a dict by FASTA id (with BioPyhton's to_dict). This gives me efficient random access (get the SeqRecord by FASTA id, and then subset it to the part I need), but it requires the whole genome to be loaded into memory. I would like to replace this with on-disk indexing.

BioPython provides on-disk indexing of FASTA files, but it appears to be at a per-record level. I can't ask BioPython to read a small subset of a FASTA record from an indexed FASTA file; I have to get the whole record (which may be many megabytes) from BioPython and then take the part I want and throw out the rest.

Is there a Pyhton library that can index within FASTA records and pull out pieces without having to read the whole record each time? Does BioPython support this use case in a way that I am not aware of (with some sort of lazy-loading SeqRecord or something)?

biopython fasta • 3.6k views

ADD COMMENT • link updated 10.7 years ago by Matt Shirley 10k • written 10.7 years ago by interfect ▴ 10

score 1 · Answer 1 · 2013-07-30

This general functionality doesn't exist in Biopython yet but was the basis of a Google Sumer of Code 2013 project idea. http://biopython.org/wiki/Google_Summer_of_Code

You could try using a BioSQL database from Biopython, for example load your FASTA file into an SQLite database. http://biopython.org/wiki/BioSQL

You could try using the pysam interface to faidx, Heng Li's indexing system for FASTA files.

score 1 · Answer 2 · 2013-07-30

1

Entering edit mode

10.7 years ago

Matt Shirley 10k

This sort of echoes Peter's answer, but you might check out brentp's fastahack-python project. It wraps the C FastaHack which allows random access to faidx indexed fasta files. I use it heavily in a current project and it's stable and fast.

ADD COMMENT • link 10.7 years ago by Matt Shirley 10k

0

Entering edit mode

The fastahack module seems to be working great for me. It has a dependency on cython that isn't listed in its setup.py, but once I figured that out I was able to get it working very quickly. Thanks!

ADD REPLY • link 10.7 years ago by interfect ▴ 10

score 0 · Answer 3 · 2013-07-30

0

Entering edit mode

10.7 years ago

Asaf 10k

The module linecache might be useful, you can easily calculate the lines you need to read in order to get the subsequence you want and then retrieve exactly the sequence. You can write a small wrapper that does this automatically (you might even contribute it to Biopython).

ADD COMMENT • link 10.7 years ago by Asaf 10k