Question

trying to select some files from a multifasta depending on accession number using python

0

Entering edit mode

10.0 years ago

sumafa • 0

I'm new trying to prgramming and I would like to separate from a multifasta, only specific sequences depending on their accession number, and create two different files, the first one with the sequences which their id is in the accession number list and the second one with the rest of files. I don't have any error after run the code but the output 1 is empty and it has to have the three sequences corresponding the accession number list.

Any comment is really welcome, and thanks in advance!

Here is my not working python code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

accession_numbers = ["BA000007", "AL513382", "BX936398"]

infile = open("sequences.fasta")
my_file_1 = open("output_1.fasta", "w")

for seq_record in SeqIO.parse(infile, "fasta"):
        my_file_1.write(">" + str(seq_record.id) + "\n" + str(seq_record.seq)  +"\n")

sequence • 3.1k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by sumafa • 0

Ram · Answer 1 · 2015-08-02

0

Entering edit mode

10.0 years ago

GouthamAtla 12k

Keeping the way you are doing apart,

if seq_record.id in accession_numbers:
    my_file_1.write(">" + strseq_record.id) + "\n" + str(seq_record.seq)  +"\n")
else:
    my_file_2.write(">" + strseq_record.id) + "\n" + str(seq_record.seq)  +"\n")

should be strseq_record.id).

If you are looking for partial string matching, I would create a small function to do that. If you have a small function, you will have all the power how you would like to match the strings. Something like:

def check_id(seq_id):
    for acc in accession_numbers:
        if acc in seq_id:
            return True

for seq_record in SeqIO.parse(infile, "fasta"):
    if check_idseq_record.id):
        SeqIO.write(seq_record, my_file_1, "fasta")
    else:
        SeqIO.write(seq_record, my_file_2, "fasta")

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by GouthamAtla 12k

0

Entering edit mode

He is not getting any errors so I think that's probably just a typo in the post.

ADD REPLY • link 10.0 years ago by Damian Kao 16k

1

Entering edit mode

But otherwise it works. May be the problem is with exact string match. I updated my post.

ADD REPLY • link 10.0 years ago by GouthamAtla 12k

0

Entering edit mode

I actually wrote strseq_record.id) but don't know why the screen doesnt show it.

thanks

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by sumafa • 0

Ram · Answer 2 · 2015-08-02

0

Entering edit mode

10.0 years ago

Damian Kao 16k

Can you paste a few lines of your fasta file? The record.id string is the first element after splitting the header by white space. So if your fasta header is NCBI formatted header, the record.id will look something like gi|568336023|gb|CM000663.2. And that will obviously not match your accession numbers.

You'll need to parse out the accession numbers from the record.id string and then do the check. So in the case that the accession is the fourth element after splitting the header by the | character:

if seq_record.id.split('|')[3] in accession_numbers:

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by Damian Kao 16k

0

Entering edit mode

You are right the record.id string is in NCBI fasta format and split method works but the code write 5 times the files of the if statement and in the second one the code write 5 times the sequences which their accession numbers are not in the list, do you have idea why its working in this way? Here is the code which I am talking about

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by sumafa • 0

0

Entering edit mode

Here is the code which I am talking about

Where is the code?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.9 years ago by GouthamAtla 12k