Sub-Record Indexing Of Fasta Files In Python?
3
1
Entering edit mode
10.7 years ago
interfect ▴ 10

Hello,

I have several genomes, each spread across multiple FASTA files. Some of these files contain multiple short sequences, and some of them contain large, chromosome-sized sequences. I need efficient random access to sequence data by FASTA record ID and sequence location (e.g. chr1:12384-12884) from Python.

I am currently using BioPython to read all of the FASTA files for a genome and store the SeqRecords in a dict by FASTA id (with BioPyhton's to_dict). This gives me efficient random access (get the SeqRecord by FASTA id, and then subset it to the part I need), but it requires the whole genome to be loaded into memory. I would like to replace this with on-disk indexing.

BioPython provides on-disk indexing of FASTA files, but it appears to be at a per-record level. I can't ask BioPython to read a small subset of a FASTA record from an indexed FASTA file; I have to get the whole record (which may be many megabytes) from BioPython and then take the part I want and throw out the rest.

Is there a Pyhton library that can index within FASTA records and pull out pieces without having to read the whole record each time? Does BioPython support this use case in a way that I am not aware of (with some sort of lazy-loading SeqRecord or something)?

biopython fasta • 3.6k views
ADD COMMENT
1
Entering edit mode
10.7 years ago
Peter 6.0k

This general functionality doesn't exist in Biopython yet but was the basis of a Google Sumer of Code 2013 project idea. http://biopython.org/wiki/Google_Summer_of_Code

You could try using a BioSQL database from Biopython, for example load your FASTA file into an SQLite database. http://biopython.org/wiki/BioSQL

You could try using the pysam interface to faidx, Heng Li's indexing system for FASTA files.

ADD COMMENT
1
Entering edit mode
10.7 years ago

This sort of echoes Peter's answer, but you might check out brentp's fastahack-python project. It wraps the C FastaHack which allows random access to faidx indexed fasta files. I use it heavily in a current project and it's stable and fast.

ADD COMMENT
0
Entering edit mode

The fastahack module seems to be working great for me. It has a dependency on cython that isn't listed in its setup.py, but once I figured that out I was able to get it working very quickly. Thanks!

ADD REPLY
0
Entering edit mode
10.7 years ago
Asaf 10k

The module linecache might be useful, you can easily calculate the lines you need to read in order to get the subsequence you want and then retrieve exactly the sequence. You can write a small wrapper that does this automatically (you might even contribute it to Biopython).

ADD COMMENT

Login before adding your answer.

Traffic: 2226 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6