Question: Correct Way To Parse A Fasta File In Python
17
gravatar for Eric Normandeau
8.5 years ago by
Eric Normandeau9.9k
Quebec, Canada
Eric Normandeau9.9k wrote:

Hi,

I have been wondering at the correct approach in Python, maybe using Biopython, of parsing a fasta file without having to place it in memory (eg: NOT having to read it to a list, dictionary or fasta class) before using it.

The desired result would behave like a generator, as in the pseudo-code example below:

fasta_sequences = fasta_generator(input_file) # The function I miss
with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta
        new_sequence = some_function(sequence)
        write_fasta(out_file) # Function defined elsewhere

Important aspects are:

  • Read sequences one at a time
  • Does not put all the sequences into memory
  • The approach is safe and well tested

Thanks for your suggestions!

parsing python fasta • 70k views
ADD COMMENTlink modified 18 days ago by RamRS17k • written 8.5 years ago by Eric Normandeau9.9k
37
gravatar for Zhaorong
8.5 years ago by
Zhaorong1.2k
State College, PA
Zhaorong1.2k wrote:

I think you can just use Biopython

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta.id, fasta.seq.tostring()
        new_sequence = some_function(sequence)
        write_fasta(out_file)

You can take a look at the Biopython tutorial: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11

ADD COMMENTlink modified 18 days ago by RamRS17k • written 8.5 years ago by Zhaorong1.2k
2

Very useful answer from 7 years ago! FYI, in current version of biopython(1.69), fasta.seq.tostring() is obsolete, use str(fasta.seq) instead.

ADD REPLYlink modified 17 months ago • written 17 months ago by Chen700
1

I use Biopython all the time, but parsing fasta files is all I ever use it for. :) Two other functions I use for fasta parsing is: SeqIO.to_dict() which builds all sequences into a dictionary and save it in memory SeqIO.index() which builds a dictionary without putting the sequences in memory

I was thinking of looking into Biopython a little deeper, since it offers much more than fasta parsing, but did not get a chance. :(

ADD REPLYlink written 8.5 years ago by Zhaorong1.2k

Very nice example. I knew Biopython offered something of the sort, but had never tried it. @Zhaorong do you have a lot of experience with Biopython? What have you used it for? Cheers

ADD REPLYlink written 8.5 years ago by Eric Normandeau9.9k
18
gravatar for brentp
8.3 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

from the implementation in the link above, i came up with this. it's pretty concise and only consumes 1 record at a time.

from itertools import groupby

def fasta_iter(fasta_name):
    """
    given a fasta file. yield tuples of header, sequence
    """
    fh = open(fasta_name)
    # ditch the boolean (x[0]) and just keep the header or sequence since
    # we know they alternate.
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        # drop the ">"
        header = header.next()[1:].strip()
        # join all sequence lines to one.
        seq = "".join(s.strip() for s in faiter.next())
        yield header, seq
ADD COMMENTlink modified 18 days ago by RamRS17k • written 8.3 years ago by brentp22k
1

I like your use of groupby and lambda

ADD REPLYlink written 8.3 years ago by Science_Robot1.1k
1

great solution. How can I make this work in Python3?

ADD REPLYlink written 6.9 years ago by User 637710
" for python3"
from itertools import groupby

def fasta_iter(fasta_name):
"""
modified from Brent Pedersen
Correct Way To Parse A Fasta File In Python
given a fasta file. yield tuples of header, sequence
"""
"first open the file outside "
fh = open(fasta_name)

# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))

for header in faiter:
    # drop the ">"
    headerStr = header.__next__()[1:].strip()

    # join all sequence lines to one.
    seq = "".join(s.strip() for s in faiter.__next__())

    yield (headerStr, seq)

fiter = fasta_iter('testset.fas')

for ff in fiter:
    headerStr, seq = ff
    print(headerStr)
ADD REPLYlink modified 18 days ago by RamRS17k • written 3.8 years ago by d.lituiev0

Nicely done, thanks for a great example on how to use groupby! I'll try to include it in my code, although I'll probably keep the Biopython fasta parser for now :)

ADD REPLYlink written 8.3 years ago by Eric Normandeau9.9k

Awesome approach, helped me a lot.

ADD REPLYlink written 22 months ago by zebudavid10

When I print "header" I get only the first header whereas I have many header. How do I get a list of all the headers or seqs?

ADD REPLYlink written 19 months ago by alowi330

That's because this is a generator function. Do get all headers you'll have to do something like:

fasta = fasta_iter("hg19.fa")
for header, seq in fasta:   
  print(header)

Or, you can use a package such as https://github.com/mdshw5/pyfaidx and do:

from pyfaidx import Fasta
fasta = Fasta("hg19.fa")
headers =fasta.keys()
ADD REPLYlink modified 18 days ago by RamRS17k • written 19 months ago by Matt Shirley8.6k
4
gravatar for brentp
8.5 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

in another thread, someone pointed you to this: http://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/

which i think is pretty beautiful. but this is trivial in any library.

ADD COMMENTlink modified 3.8 years ago by Istvan Albert ♦♦ 77k • written 8.5 years ago by brentp22k
4
gravatar for Science_Robot
8.3 years ago by
Science_Robot1.1k
Gainesville, FL
Science_Robot1.1k wrote:

Here's one I used for a script. I don't know if it's necessarily the best. I've made dozens of different variations.

You do fasta = Fasta(handle) and then for record in fasta to yield Dna objects

class Dna:
    ''' Object representing a FASTA record. '''
    def __init__(self, header, sequence):
        self.head = header
        self.seq = sequence
    def __repr__(self):
        return '[HTML]' % (self.head)
    def __str__(self, separator=''):
        return '>%s\n%s' % (self.head, separator.join(self.seq))
    def __len__(self):
        return len(''.join(self.seq))
    @property
    def sequence(self, separator=''):
        return separator.join(self.seq)

class Fasta:
    ''' A FASTA iterator/generates DNA objects. '''
    def __init__(self, handle):
        self.handle = handle
    def __repr__(self):
        return '[HTML]' % handle
    def __iter__(self):
        header, sequence = '', []
        for line in self.handle:
            if line[0] == '>':
                if sequence: yield Dna(header, sequence)
                header = line[1:-1]
                sequence = []
            else:
                sequence.append(line.strip())
        yield Dna(header, sequence)
ADD COMMENTlink modified 18 days ago by RamRS17k • written 8.3 years ago by Science_Robot1.1k
4
gravatar for Matt Shirley
4.2 years ago by
Matt Shirley8.6k
Cambridge, MA
Matt Shirley8.6k wrote:

You might check out Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. This is a package that I wrote that builds or reads a samtools compatible fasta index and can read arbitrary amounts of the fasta file at once. I've kept it up to date and am always trying to make it more lightweight and "pythonic". Even though it's designed for random access there is an efficient line-based iterator method:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...
ADD COMMENTlink modified 18 days ago by RamRS17k • written 4.2 years ago by Matt Shirley8.6k
2
gravatar for Paulo Nuin
8.5 years ago by
Paulo Nuin3.7k
Canada
Paulo Nuin3.7k wrote:

A very simple and not really elegant approach I have on Beginning Python for Bioinformatics website:

class FastaSeq:
def __init__(self, name, sequence):
    self.name = name
    self.sequence = sequence

def get_seqs(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
            if index >= 1:
                items.append(aninstance)
            index+=1
            name = line[:-1]
            seq = ''
            aninstance = FastaSeq(name, seq)
        else:
            seq += line[:-1]
            aninstance = FastaSeq(name, seq)
    items.append(aninstance)

    return items

It works as a learning experience. But you don't want to reinvent the wheel on this one.

ADD COMMENTlink modified 18 days ago by RamRS17k • written 8.5 years ago by Paulo Nuin3.7k
3

Any approach that builds a full list is unfeasible in the world of high throughput sequencing. In general any record based reader should be a generator that can simply be unwound into a list if one so desires.

ADD REPLYlink written 8.5 years ago by Istvan Albert ♦♦ 77k
1

Strings are immutable, so python has to create a new string object every time you append. It's better to use seq = [], seq.append(line[:-1]), and then ''.join(seq)

ADD REPLYlink modified 18 days ago by RamRS17k • written 8.3 years ago by Science_Robot1.1k

Thanks @nuin. However, this is exactly what I specified I wanted to avoid, for the reasons mentioned by @Istvan Albert.

ADD REPLYlink written 8.5 years ago by Eric Normandeau9.9k

Then just change the return statement to a yield. The code needs to be modified to handle the first sequence.

ADD REPLYlink written 8.5 years ago by Paulo Nuin3.7k

I guess you meant: change the items.append(aninstance) to yield aninstance

ADD REPLYlink modified 18 days ago by RamRS17k • written 8.5 years ago by Michael Kuhn4.9k

yes, that's right. The return statement shouldn't exist then.

ADD REPLYlink written 8.5 years ago by Paulo Nuin3.7k
0
gravatar for parashar.dhapola
3.2 years ago by
United States
parashar.dhapola150 wrote:

I regularly use this one... Its much useful when parsing large FASTA file with multiple entries.

def fasta_reader(filename):
  from Bio.SeqIO.FastaIO import FastaIterator
  with open(filename) as handle:
    for record in FastaIterator(handle):
      yield record

for entry in fasta_reader("file.fasta"):
  print strentry.id) #This is header of fasta entry
  print str(entry.seq) #This is sequence of specific fasta entry
ADD COMMENTlink modified 18 days ago by RamRS17k • written 3.2 years ago by parashar.dhapola150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 995 users visited in the last hour