Correct Way To Parse A Fasta File In Python
8
27
Entering edit mode
13.0 years ago

Hi,

I have been wondering at the correct approach in Python, maybe using Biopython, of parsing a fasta file without having to place it in memory (eg: NOT having to read it to a list, dictionary or fasta class) before using it.

The desired result would behave like a generator, as in the pseudo-code example below:

fasta_sequences = fasta_generator(input_file) # The function I miss
with open(output_file) as out_file:
for fasta in fasta_sequences:
name, sequence = fasta
new_sequence = some_function(sequence)
write_fasta(out_file) # Function defined elsewhere


Important aspects are:

• Read sequences one at a time
• Does not put all the sequences into memory
• The approach is safe and well tested

fasta parsing python • 175k views
46
Entering edit mode
13.0 years ago
Zhaorong ★ 1.4k

I think you can just use Biopython

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
with open(output_file) as out_file:
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
new_sequence = some_function(sequence)
write_fasta(out_file)


You can take a look at the Biopython tutorial: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11

3
Entering edit mode

I use Biopython all the time, but parsing fasta files is all I ever use it for. :) Two other functions I use for fasta parsing is: SeqIO.to_dict() which builds all sequences into a dictionary and save it in memory SeqIO.index() which builds a dictionary without putting the sequences in memory

I was thinking of looking into Biopython a little deeper, since it offers much more than fasta parsing, but did not get a chance. :(

2
Entering edit mode

Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.

0
Entering edit mode

Very nice example. I knew Biopython offered something of the sort, but had never tried it. @Zhaorong do you have a lot of experience with Biopython? What have you used it for? Cheers

0
Entering edit mode

fasta.seq.tostring() should have been str(fasta.seq). This example is wrong.

3
Entering edit mode

The example is now wrong, but almost 9 years ago when this was written it was perfectly valid. The Bio.Seq.Seq.tostring() method was removed in the latest Biopython release, and was deprecated a bit before: https://github.com/biopython/biopython/blob/654309121f2cc0c01dfff73cd3dec3a435d76fc2/DEPRECATED.rst#bioseqseqtostring-and-bioseqmutableseqtostring

2
Entering edit mode

It is indeed wrong today. I edited the answer since it has been possible to use str(sequence) for a long time now.

1
Entering edit mode

Thanks. I constantly forget that I'm a moderator.

19
Entering edit mode
12.8 years ago
brentp 24k

from the implementation in the link above, i came up with this. it's pretty concise and only consumes 1 record at a time.

from itertools import groupby

def fasta_iter(fasta_name):
"""
given a fasta file. yield tuples of header, sequence
"""
fh = open(fasta_name)
# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
# drop the ">"
# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.next())

1
Entering edit mode

I like your use of groupby and lambda

1
Entering edit mode

great solution. How can I make this work in Python3?

1
Entering edit mode
" for python3"
from itertools import groupby

def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
Correct Way To Parse A Fasta File In Python
given a fasta file. yield tuples of header, sequence
"""
"first open the file outside "
fh = open(fasta_name)

# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))

# drop the ">"

# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.__next__())

fiter = fasta_iter('testset.fas')

for ff in fiter:

0
Entering edit mode

Thanks for the code. I'd like to supplement it for opeing file in binary mode.

fin = open(filename, 'rb')

faiter = (x[1] for x in groupby(fin, lambda line: str(line, 'utf-8')[0] == ">"))
name = long_name.split()[0]
seq = "".join(str(s, 'utf-8').strip() for s in faiter.__next__())
yield (name, seq, long_name)

0
Entering edit mode

Nicely done, thanks for a great example on how to use groupby! I'll try to include it in my code, although I'll probably keep the Biopython fasta parser for now :)

0
Entering edit mode

Awesome approach, helped me a lot.

0
Entering edit mode

When I print "header" I get only the first header whereas I have many header. How do I get a list of all the headers or seqs?

0
Entering edit mode

That's because this is a generator function. Do get all headers you'll have to do something like:

fasta = fasta_iter("hg19.fa")


Or, you can use a package such as https://github.com/mdshw5/pyfaidx and do:

from pyfaidx import Fasta
fasta = Fasta("hg19.fa")

4
Entering edit mode
13.0 years ago
brentp 24k

in another thread, someone pointed you to this: http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/

which i think is pretty beautiful. but this is trivial in any library.

4
Entering edit mode
12.8 years ago
Science_Robot ★ 1.1k

Here's one I used for a script. I don't know if it's necessarily the best. I've made dozens of different variations.

You do fasta = Fasta(handle) and then for record in fasta to yield Dna objects

class Dna:
''' Object representing a FASTA record. '''
self.seq = sequence
def __repr__(self):
def __str__(self, separator=''):
def __len__(self):
return len(''.join(self.seq))
@property
def sequence(self, separator=''):
return separator.join(self.seq)

class Fasta:
''' A FASTA iterator/generates DNA objects. '''
def __init__(self, handle):
self.handle = handle
def __repr__(self):
return '[HTML]' % handle
def __iter__(self):
for line in self.handle:
if line[0] == '>':
sequence = []
else:
sequence.append(line.strip())

4
Entering edit mode
8.7 years ago

You might check out Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. This is a package that I wrote that builds or reads a samtools compatible fasta index and can read arbitrary amounts of the fasta file at once. I've kept it up to date and am always trying to make it more lightweight and "pythonic". Even though it's designed for random access there is an efficient line-based iterator method:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...

2
Entering edit mode
13.0 years ago
Paulo Nuin ★ 3.7k

A very simple and not really elegant approach I have on Beginning Python for Bioinformatics website:

class FastaSeq:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence

def get_seqs(file):
items = []
index = 0
for line in file:
if line.startswith(">"):
if index >= 1:
items.append(aninstance)
index+=1
name = line[:-1]
seq = ''
aninstance = FastaSeq(name, seq)
else:
seq += line[:-1]
aninstance = FastaSeq(name, seq)
items.append(aninstance)

return items


It works as a learning experience. But you don't want to reinvent the wheel on this one.

3
Entering edit mode

Any approach that builds a full list is unfeasible in the world of high throughput sequencing. In general any record based reader should be a generator that can simply be unwound into a list if one so desires.

1
Entering edit mode

Strings are immutable, so python has to create a new string object every time you append. It's better to use seq = [], seq.append(line[:-1]), and then ''.join(seq)

0
Entering edit mode

Thanks @nuin. However, this is exactly what I specified I wanted to avoid, for the reasons mentioned by @Istvan Albert.

0
Entering edit mode

Then just change the return statement to a yield. The code needs to be modified to handle the first sequence.

0
Entering edit mode

I guess you meant: change the items.append(aninstance) to yield aninstance

0
Entering edit mode

yes, that's right. The return statement shouldn't exist then.

0
Entering edit mode
7.7 years ago

I regularly use this one... Its much useful when parsing large FASTA file with multiple entries.

def fasta_reader(filename):
from Bio.SeqIO.FastaIO import FastaIterator
with open(filename) as handle:
for record in FastaIterator(handle):
yield record

print strentry.id) #This is header of fasta entry
print str(entry.seq) #This is sequence of specific fasta entry

0
Entering edit mode
18 months ago
yampolsk ▴ 10

Because I teach biology students with no previous coding experience here is a code to read fasta that requires no understanding of somewhat more advanced things. It is by no means efficient for large files, but works for teaching purposes.

0
Entering edit mode
deflines=[]
sequences=[]
f = open("test.fasta", "r")
morelines = True
while morelines:
seq=""
moreseq = True
while moreseq:
if not nxtline:
morelines = False
break
elif nxtline[0] != ">":
seq+=nxtline.strip()
else:
deflines.append(nxtline.strip())
moreseq = False
sequences.append(seq)
for seq in sequences:
print(seq)
for d in deflines:
print(d)
f.close()

0
Entering edit mode

oops I don't understand why all end of lines and indents disappeared. If someone knows how to edit this to look like a code please do...