Question: Extracting Query Sequence From Blast Xml Output Using Ncbixml Parser
gravatar for viv_bio
6.8 years ago by
IIT Bombay
viv_bio30 wrote:

I want to Blast my data base of sequence (Aminoacid) with a data base. And get sequences which have a given score and e value less than 0.5 . So I ran blast with e-value 0.5 and now i want to isolate those sequences which have scores higher than given value . So i parse it using NCBIXML parser from Biopython. So far i have -

from Bio.Blast import NCBIXML 
blast = NCBIXML.parse(open(raw_input("Blast file location :"),"rU")) 
for record in blast :
            if record.alignments:
                if record.alignments[0].hsps[0].score >= 1000:
                    print record.alignments[0].hsps[0].query

This gives me all the sequence but i want it with there respective gene id in seq format so i can use it for further purpose .

biopython blast • 5.3k views
ADD COMMENTlink written 6.8 years ago by viv_bio30

Don't know what you mean by "seq format". Fasta?

ADD REPLYlink written 6.8 years ago by Neilfws48k
gravatar for dfornika
6.8 years ago by
Vancouver, British Columbia, Canada
dfornika1000 wrote:

I'm just looking through the biopython api docs and noticed that the Alignmnet class has a hit_id method. Try adding this line:

print record.alignments[0].hit_id

... and see if those ID's are useful for you. I'm not totally clear on exactly what format you need the gene id in for your downstream use, so you may need to parse it a bit before passing it along.

ADD COMMENTlink written 6.8 years ago by dfornika1000
gravatar for Damian Kao
6.8 years ago by
Damian Kao15k
Damian Kao15k wrote:

The XML file does not contain the raw sequences. You are going to have to get the sequence IDs from the XML file and then use the IDs to extract those sequences from the query fasta file. You can use biopython's SeqIO to parse the fasta file. So something like this:

ids = dict([(x, True) for x in open('your ids file','r')])

for record in SeqIO.parse(open('your query fasta file','r'),'fasta'):
      print ">" + record id + "\n" + str(record.seq)

Assuming you have extracted a list of sequence ids in a file separate by line.

ADD COMMENTlink written 6.8 years ago by Damian Kao15k

It would be more Pythonic to use 'if id in ids' rather than 'if', and also there is no point using a dictionary like this where all the values are True. Use a Python set instead, which also has the efficient hash based membership testing offered by a dict but not a list.

ADD REPLYlink written 6.8 years ago by Peter5.8k

Yeah I use has_key out of habit. 'In' just feels intuitively slower even though it's not. I didn't know sets also use hash based membership. Thanks.

ADD REPLYlink written 6.8 years ago by Damian Kao15k

Thanks this post helped

ADD REPLYlink written 6.8 years ago by viv_bio30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1798 users visited in the last hour