Question: trying to select some files from a multifasta depending on accession number using python
0
gravatar for sumafa
3.8 years ago by
sumafa0
European Union
sumafa0 wrote:

I'm am new trying to prgramming and I woul like to separate from a multifasta, only specific sequences depending on their accession number, and create two different files, the first one with the sequences which their id is in the accession number list and the second one with the rest of files. I dont have any error after run the code but the output 1 is empty and it has to have the three sequences corresponding the accession number list....

Any comment is really welcome, and thanks in advance!

here is my not working python code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

accession_numbers = ["BA000007", "AL513382", "BX936398"]

infile = open("sequences.fasta")
my_file_1 = open("output_1.fasta", "w")

for seq_record in SeqIO.parse(infile, "fasta"):
    
        my_file_1.write(">" + str(seq_record.id) + "\n" + str(seq_record.seq)  +"\n")
    
sequence • 1.4k views
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by sumafa0
0
gravatar for geek_y
3.8 years ago by
geek_y9.6k
Barcelona/CRG/London/Imperial
geek_y9.6k wrote:

Keeping the way you are doing apart,

if seq_record.id in accession_numbers:
    my_file_1.write(">" + strseq_record.id) + "\n" + str(seq_record.seq)  +"\n")
else:
    my_file_2.write(">" + strseq_record.id) + "\n" + str(seq_record.seq)  +"\n")

should be str(seq_record.id).

If you are looking for partial string matching, I would create a small function to do that. If you have a small function, you will have all the power how you would like to match the strings. Something like:

def check_id(seq_id):
    for acc in accession_numbers:
        if acc in seq_id:
            return True​

for seq_record in SeqIO.parse(infile, "fasta"):
    if check_id(seq_record.id):
        SeqIO.write(seq_record, my_file_1, "fasta")
    else:
        SeqIO.write(seq_record, my_file_2, "fasta")  ​
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by geek_y9.6k

He is not getting any errors so I think that's probably just a typo in the post.

ADD REPLYlink written 3.8 years ago by Damian Kao15k
1

But otherwise it works. May be the problem is with exact string match. I updated my post.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by geek_y9.6k

I actually wrote str(seq_record.id) but dont know why the screen doesnt show it.

thanks

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by sumafa0
0
gravatar for Damian Kao
3.8 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Can you paste a few lines of your fasta file? The record.id string is the first element after splitting the header by white space. So if your fasta header is NCBI formatted header, the record.id will look something like "gi|568336023|gb|CM000663.2". And that will obviously not match your accession numbers. 

You'll need to parse out the accession numbers from the record.id string and then do the check. So in the case that the accession is the fourth element after splitting the header by the '|' character:

if seq_record.id.split('|')[3] in acession_numbers:

 

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Damian Kao15k

You are right the  record.id string is in NCBI fasta format and split method works but the code write 5 times the files of the if statement and in the second one the code write 5 times the sequences which their accession numbers are not in the list, do you have idea why its working in this way? Here is the code which I am talking about

thanks

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by sumafa0

"Here is the code which I am talking about"  Where is the code ?

ADD REPLYlink written 3.8 years ago by geek_y9.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1843 users visited in the last hour