Extracting Query Sequence From Blast Xml Output Using Ncbixml Parser
2
1
Entering edit mode
9.8 years ago
viv_bio ▴ 50

I want to Blast my data base of sequence (Aminoacid) with a data base. And get sequences which have a given score and e value less than 0.5 . So I ran blast with e-value 0.5 and now i want to isolate those sequences which have scores higher than given value . So i parse it using NCBIXML parser from Biopython. So far i have -

from Bio.Blast import NCBIXML 
blast = NCBIXML.parse(open(raw_input("Blast file location :"),"rU")) 
for record in blast :
            if record.alignments:
                if record.alignments[0].hsps[0].score >= 1000:
                    print record.alignments[0].hsps[0].query

This gives me all the sequence but i want it with there respective gene id in seq format so i can use it for further purpose .

biopython blast • 7.2k views
ADD COMMENT
0
Entering edit mode

Don't know what you mean by "seq format". Fasta?

ADD REPLY
2
Entering edit mode
9.8 years ago
dfornika ★ 1.0k

I'm just looking through the biopython api docs and noticed that the Alignmnet class has a hit_id method. Try adding this line:

print record.alignments[0].hit_id

... and see if those ID's are useful for you. I'm not totally clear on exactly what format you need the gene id in for your downstream use, so you may need to parse it a bit before passing it along.

ADD COMMENT
1
Entering edit mode
9.8 years ago

The XML file does not contain the raw sequences. You are going to have to get the sequence IDs from the XML file and then use the IDs to extract those sequences from the query fasta file. You can use biopython's SeqIO to parse the fasta file. So something like this:

ids = dict([(x, True) for x in open('your ids file','r')])

for record in SeqIO.parse(open('your query fasta file','r'),'fasta'):
   if ids.has_keyrecord.id):
      print ">" + record id + "\n" + str(record.seq)

Assuming you have extracted a list of sequence ids in a file separate by line.

ADD COMMENT
2
Entering edit mode

It would be more Pythonic to use 'if id in ids' rather than 'if ids.has_keyrecord.id)', and also there is no point using a dictionary like this where all the values are True. Use a Python set instead, which also has the efficient hash based membership testing offered by a dict but not a list.

ADD REPLY
0
Entering edit mode

Yeah I use has_key out of habit. 'In' just feels intuitively slower even though it's not. I didn't know sets also use hash based membership. Thanks.

ADD REPLY
0
Entering edit mode

Thanks this post helped

ADD REPLY

Login before adding your answer.

Traffic: 963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6