Filter sequences by length and rename them with biopython
2
1
Entering edit mode
9.4 years ago

I guess this is a very simple question for most of the people here. But I really want to know how to solve it. Thanks.

When I got a fasta file from genbank, I want to remove sequences <500 aa and rename each sequence with gb accession number and common name. I wrote two small files to solve the problem. But I don't know how to write a main program to put them together. The input file is rdrp all ref.fasta

1 sequence length filter

from Bio import SeqIO

def long_seq(filename):
    long_sequences = []
    input_handle=open(filename,'rU')
    output_handle = open("long_seqs.fasta", "w")

    for record in SeqIO.parse(input_handle, "fasta") :
        if len(record.seq) >= 500 :
            # Add this record to our list
            long_sequences.append(record)

    print "Found %i long sequences" % len(long_sequences)

    SeqIO.write(long_sequences, output_handle, "fasta")
    input_handle.close()
    output_handle.close()

if __name__=="__main__":
    long_seq('rdrp all ref.fasta')

2 rename each sequence from genbank fasta file with gb number + common name

def rename_seq(filename):
    input_handle=open(filename)
    output_handle=open("rename_seq.fasta","w")

    for line in input_handle:
        if '>' in line:
            id_old=line.strip()
            a=id_old.find('.')
            b=id_old.find('[')
            c=id_old.find(']')
            id_new='>'+id_old[(a-8):a]+' '+id_old[(b+1):c]
            print>>output_handle, id_new
        else:
            seq=line.strip()
            print>>output_handle, seq    


    input_handle.close()
    output_handle.close()

if __name__=="__main__":
    rename_seq('long_seqs.fasta')
sequence • 5.7k views
ADD COMMENT
1
Entering edit mode
9.4 years ago
Ram 43k

You could minimize the number of times you open the file by making long_seq(seq,X) simply about passing a flag if the len(seq)>X (though that would make the function meaningless).

Or, you could just put the files in the same directory and call one from the other - you'll just have to add an import with the file name of the file you're calling, which in your case would be:

import long_seq_filter.py

I'd also suggest using SeqIO in your second script so you can parse FASTA without worries.

ADD COMMENT
0
Entering edit mode
9.4 years ago
#shorter code.

# sequence length filter and
# rename the sequence with gb accession no. and common name
# input_file.fasta and output_file.fasta are given as two cmd args

from Bio import SeqIO
import sys

def long_seq(filename):
    long_sequences = []
    input_handle=open(sys.argv[1],'rU')
    output_handle = open(sys.argv[2], "w")

    for record in SeqIO.parse(input_handle, "fasta") :
        if len(record.seq) >= 500 :
            long_sequences.append(record)

    print "Found %i long sequences" % len(long_sequences)

    for item in long_sequences:
        id_old=item.description
        a=id_old.find('.')
        b=id_old.find('[')
        c=id_old.find(']')
        id_new='>'+id_old[(a-8):a]+' '+id_old[(b+1):c]
        print>>output_handle, id_new
        sequence=item.seq
        print>>output_handle, sequence


    input_handle.close()
    output_handle.close()

if __name__=="__main__":
    long_seq(sys.argv[1])
ADD COMMENT

Login before adding your answer.

Traffic: 2646 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6