What is the most efficient way to parse a single fasta file which contains over 1000 fasta sequences in python ?
2
0
Entering edit mode
5.7 years ago

I would like to store each fasta sequence as a string in a format which will allow me to parse the strings with a function that calculates a measure for the entire sequence (ex a function which calculates average protein weight). I would than like to plot some of the data ?

genome python • 2.1k views
ADD COMMENT
1
Entering edit mode
5.7 years ago

One efficient way to parse:

#!/usr/bin/env python

import sys
from collections import OrderedDict

header = ""
sequence = ""
records = OrderedDict()

for line in sys.stdin:
  line = line.rstrip()
  if line.startswith('>'):
    if len(sequence) > 0:
      records[header] = sequence
    header = line
    sequence = ""
  else:
    sequence += line
if len(sequence) > 0:
  records[header] = sequence


#
# do stuff with records, as you would any other dictionary, like 
# iterate over records.keys() to work on each sequence at a time, etc.
#

To use, e.g.:

$ ./parse_fasta.py < records.fa

There are larger libraries for parsing FASTA, but they tend to run slowly and use a lot of memory. YMMV.

ADD COMMENT
0
Entering edit mode

I'd do this with a generator, if the sequences do not have to be kept, but could be processed on the fly.

Using Alex's code as example that would stuff the whole block into a function, e.g. parseFasta() and then replace the records[header] = sequence with a yield sequence. The whole thing could then be used like this:

for seq in readFasta():
    do_something_with_seq(seq)

Edit: You can get that very same functionality implemented in https://github.com/cschu/ktio. Or, I think pip install ktio and the n from ktio.ktio import readFasta should also work. (Sorry, badly documented!)

ADD REPLY
0
Entering edit mode
5.7 years ago

Thanks guys - this is my first time using biostars and this is really helpful. To speed this up should I look into multithreading etc if I have access to a cloud server with 24 CPUs ?

ADD COMMENT
0
Entering edit mode

Threads won't help. This is an IO-limited procedure. Computational power is not needed for reading in data.

ADD REPLY

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6