Question: Sub-Record Indexing Of Fasta Files In Python?
0
gravatar for interfect
5.6 years ago by
interfect0
United States
interfect0 wrote:

Hello,

I have several genomes, each spread across multiple FASTA files. Some of these files contain multiple short sequences, and some of them contain large, chromosome-sized sequences. I need efficient random access to sequence data by FASTA record ID and sequence location (e.g. chr1:12384-12884) from Python.

I am currently using BioPython to read all of the FASTA files for a genome and store the SeqRecords in a dict by FASTA id (with BioPyhton's to_dict). This gives me efficient random access (get the SeqRecord by FASTA id, and then subset it to the part I need), but it requires the whole genome to be loaded into memory. I would like to replace this with on-disk indexing.

BioPython provides on-disk indexing of FASTA files, but it appears to be at a per-record level. I can't ask BioPython to read a small subset of a FASTA record from an indexed FASTA file; I have to get the whole record (which may be many megabytes) from BioPython and then take the part I want and throw out the rest.

Is there a Pyhton library that can index within FASTA records and pull out pieces without having to read the whole record each time? Does BioPython support this use case in a way that I am not aware of (with some sort of lazy-loading SeqRecord or something)?

fasta biopython • 2.2k views
ADD COMMENTlink modified 5.6 years ago by Matt Shirley8.8k • written 5.6 years ago by interfect0
1
gravatar for Peter
5.6 years ago by
Peter5.7k
Scotland, UK
Peter5.7k wrote:

This general functionality doesn't exist in Biopython yet but was the basis of a Google Sumer of Code 2013 project idea. http://biopython.org/wiki/Google_Summer_of_Code

You could try using a BioSQL database from Biopython, for example load your FASTA file into an SQLite database. http://biopython.org/wiki/BioSQL

You could try using the pysam interface to faidx, Heng Li's indexing system for FASTA files.

ADD COMMENTlink written 5.6 years ago by Peter5.7k
1
gravatar for Matt Shirley
5.6 years ago by
Matt Shirley8.8k
Cambridge, MA
Matt Shirley8.8k wrote:

This sort of echoes Peter's answer, but you might check out brentp's fastahack-python project. It wraps the C FastaHack which allows random access to faidx indexed fasta files. I use it heavily in a current project and it's stable and fast.

ADD COMMENTlink written 5.6 years ago by Matt Shirley8.8k

The fastahack module seems to be working great for me. It has a dependency on cython that isn't listed in its setup.py, but once I figured that out I was able to get it working very quickly. Thanks!

ADD REPLYlink written 5.6 years ago by interfect0
0
gravatar for Asaf
5.6 years ago by
Asaf5.2k
Israel
Asaf5.2k wrote:

The module linecache might be useful, you can easily calculate the lines you need to read in order to get the subsequence you want and then retrieve exactly the sequence. You can write a small wrapper that does this automatically (you might even contribute it to Biopython).

ADD COMMENTlink written 5.6 years ago by Asaf5.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1331 users visited in the last hour