Question: Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number
0
gravatar for charles.bridges
8.5 years ago by
charles.bridges70 wrote:

I should start out by saying that I'm as new as it gets to both Python and Biopython. I'm trying to split a large .fasta file (with multiple entries) into single files, each with a single entry. I found most of the following code on the Biopython wiki/ Cookbook site, and adapted it just a bit. My problem is that this generator names them as "1.fasta", "2.fasta", etc. and I need them named by some identifier such as GI number.

 def batch_iterator(iterator, batch_size) :
    """Returns lists of length batch_size.

    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.

    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    """
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = next(iterator)
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
record_iter = SeqIO.parse(open(infile), "fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1)) :
    outfile = "c:\python32\myfiles\%i.fasta" % (i+1)
    handle = open(outfile, "w")
    count = SeqIO.write(batch, handle, "fasta")
    handle.close()
    print ("Wrote %i records to %s" % (count, outfile))

If I try to replace:

outfile = "c:\python32\myfiles\%i.fasta" % (i+1)

with:

 outfile = "c:\python32\myfiles\%s.fasta" % record_iter.id)

so that it will name something similar to seq_record.id in SeqIO, it gives the following error:

   Traceback (most recent call last):
  File "C:\Python32\myscripts\generator.py", line 33, in <module>
    outfile = "c:\python32\myfiles\%s.fasta" % record_iter.id)
AttributeError: 'generator' object has no attribute 'id'

Although the generator function has no attribute 'id', can I get around this somehow? Is this script too complicated for what I'm trying to do?!? Thanks, Charles

fasta biopython • 6.4k views
ADD COMMENTlink written 8.5 years ago by charles.bridges70

If you want just one sequence per file, the batch function is overkill.

ADD REPLYlink written 8.5 years ago by Peter5.8k
1
gravatar for Niek De Klein
8.5 years ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

I'm not exactly sure what you're trying to do with batch_iterator, but if you just want to make a separate file for each fasta entry you could do:

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
for record in SeqIO.parse(infile, "fasta"):
     outfile = open("c:\\python32\\myfiles\\"+record.id+".fasta"
     outfile.write(">"+record.description+"\n")
     outfile.write(record.seq)
     outfile.close()

Can I ask why you need a separate file for each fasta sequence?

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Niek De Klein2.5k

You are missing a ">" for the FASTA output. How about this for shortness:

from Bio import SeqIO
infile = input('Which .fasta file would you like to open? ')
for record in SeqIO.parse(infile, "fasta"):
    SeqIO.write(record, record.id + ".fasta", "fasta")

This assumes the record ID doesn't contain any nasty characters which would be invalid in a filename.

ADD REPLYlink modified 8.5 years ago • written 8.5 years ago by Peter5.8k

But of course the record ID contains invalid characters!! How could I go about replacing the | symbols with a filename-friendly character, such as - (dash) ?

ADD REPLYlink written 8.5 years ago by charles.bridges70

you can do record.id.replace(";","_"), where you have to change the , in the illegal character.

ADD REPLYlink written 8.5 years ago by Niek De Klein2.5k
1
gravatar for Manu Prestat
8.5 years ago by
Manu Prestat4.0k
Lyon, France
Manu Prestat4.0k wrote:

Install genome tools and use the command-line:

gt splitfasta -splitdesc multifastafile.fa
ADD COMMENTlink written 8.5 years ago by Manu Prestat4.0k
0
gravatar for Leonor Palmeira
8.5 years ago by
Leonor Palmeira3.7k
Liège, Belgium
Leonor Palmeira3.7k wrote:

From the Python Bioinformatics blog, you can make yourself a nice and complete script that you can then call simply by:

split_fasta.py bigfile.fasta /path/to/directory/where/to/put_the_split_files

Just add this file in a directory explored by your $PATH, and make it executable with a:

chmod +x split_fasta.py

Here is the script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import sys
from optparse import OptionParser
from Bio import SeqIO

usage = "usage: %prog fasta_file_in directory_out"
parser = OptionParser(usage)
(opts, args) = parser.parse_args()
dir_out = os.getcwd()

if len(args)<1:
    print "Error: Please enter at least one argument."
    print "See program_name.py --help"
    sys.exit()
elif len(args)==2:
    dir_out = args[1]
elif len(args)>2:
    print "error: Please enter up to 2 arguments."
    print "See program_name.py --help"
    sys.exit()

file_in = args[0]

for record in SeqIO.parse(open(file_in), "fasta"):
    f_out = os.path.join(dir_out,record.id+'.fasta')
    SeqIO.write([record],open(f_out,'w'),"fasta")
ADD COMMENTlink written 8.5 years ago by Leonor Palmeira3.7k

Thank you, Leonor

ADD REPLYlink written 8.4 years ago by charles.bridges70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1433 users visited in the last hour