Question: Filter sequences by length and rename them with biopython
1
gravatar for linlinli99013
4.9 years ago by
United States
linlinli9901310 wrote:

I guess this is a very simple question for most of the people here. But I really want to know how to solve it. Thanks.

When I got a fasta file from genbank, I want to remove sequences <500 aa and rename each sequence with gb accession number and common name. I wrote two small files to solve the problem. But I don't know how to write a main program to put them together. The input file is "rdrp all ref.fasta"

 

# 1 sequence length filter

from Bio import SeqIO
 
def long_seq(filename):
    long_sequences = []
    input_handle=open(filename,'rU')
    output_handle = open("long_seqs.fasta", "w")
    
    for record in SeqIO.parse(input_handle, "fasta") :
        if len(record.seq) >= 500 :
            # Add this record to our list
            long_sequences.append(record)
 
    print "Found %i long sequences" % len(long_sequences)

    SeqIO.write(long_sequences, output_handle, "fasta")
    input_handle.close()
    output_handle.close()

if __name__=="__main__":
    long_seq('rdrp all ref.fasta')

 

#2 rename each sequence from genbank fasta file with gb number + common name

def rename_seq(filename):
    input_handle=open(filename)
    output_handle=open("rename_seq.fasta","w")

    for line in input_handle:
        if '>' in line:
            id_old=line.strip()
            a=id_old.find('.')
            b=id_old.find('[')
            c=id_old.find(']')
            id_new='>'+id_old[(a-8):a]+' '+id_old[(b+1):c]
            print>>output_handle, id_new
        else:
            seq=line.strip()
            print>>output_handle, seq    
        

    input_handle.close()
    output_handle.close()

if __name__=="__main__":
    rename_seq('long_seqs.fasta')

 

sequence forum • 3.3k views
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by linlinli9901310
1
gravatar for RamRS
4.9 years ago by
RamRS24k
Houston, TX
RamRS24k wrote:

You could minimize the number of times you open the file by making long_seq(seq,X) simply about passing a flag if the len(seq)>X (though that would make the function meaningless).

Or, you could just put the files in the same directory and call one from the other - you'll just have to add an import with the file name of the file you're calling, which in your case would be:

import long_seq_filter.py

I'd also suggest using SeqIO in your second script so you can parse FASTA without worries.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by RamRS24k
0
gravatar for linlinli99013
4.9 years ago by
United States
linlinli9901310 wrote:
#shorter code.

# sequence length filter and
# rename the sequence with gb accession no. and common name
# input_file.fasta and output_file.fasta are given as two cmd args

from Bio import SeqIO
import sys
 
def long_seq(filename):
    long_sequences = []
    input_handle=open(sys.argv[1],'rU')
    output_handle = open(sys.argv[2], "w")
    
    for record in SeqIO.parse(input_handle, "fasta") :
        if len(record.seq) >= 500 :
            long_sequences.append(record)
 
    print "Found %i long sequences" % len(long_sequences)

    for item in long_sequences:
        id_old=item.description
        a=id_old.find('.')
        b=id_old.find('[')
        c=id_old.find(']')
        id_new='>'+id_old[(a-8):a]+' '+id_old[(b+1):c]
        print>>output_handle, id_new
        sequence=item.seq
        print>>output_handle, sequence

    
    input_handle.close()
    output_handle.close()

if __name__=="__main__":
    long_seq(sys.argv[1])
    


    

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by linlinli9901310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1312 users visited in the last hour