Question

How To Differentiate Databases When Blasting From The Command Line?

2

Entering edit mode

13.8 years ago

Iván ▴ 60

Hello,

I'm creating a python script to take each ORF from a genome and blasting it against 3 other genomes in order to retrieve common and unique ORFs for the query genome.

I am using Bio.NCBIXML to parse the blast results and am runnning the blast command using os.system as following: blastall -p blastn -e 1e-05 -i input.fasta -d "db1 db2 db3" -m 7 -o output.xml

So as you can see I formated the genomes separately as databases db1 to db3. My question is how can I know in which database, ie. for which genome, is my query ORF aligning? I thought I could extract this info from the xml tag BlastOutput_db, but it just shows all the databases as one string, like this: <BlastOutput_db>db1 db2 db3</BlastOutput_db> for each alignment, so it's no good!

This question is directly related to this, but the solution provided doesn't work as I pointed out the BlastOutput tag just lists all databases passed to the blast command line.

One other solution as suggested on the other thread would be to take each fasta header from the subject genomes (.ffn file) and convert to something like >db1|orf annotation, >db2|orf annotation, which I could then extract, but maybe there's a easier way, since each genome has around 5,000 ORFs so I would have to change 15,000 fasta headers.

xml parsing blast python • 3.5k views

ADD COMMENT • link updated 13.8 years ago by ALchEmiXt ★ 1.9k • written 13.8 years ago by Iván ▴ 60

Ram · Answer 1 · 2011-09-18

5

Entering edit mode

13.8 years ago

Yannick Wurm ★ 2.5k

Your suggestions seems easy. To change the fasta headers, do eg:

sed 's/^>/>db1|/' < inputdb1.fasta > outputdb1.fasta

Cheers
y

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.8 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

Hmm, easy indeed! I hadn't thought of sed'ing everything in one go. If no one else has a way of directly extracting the db id hit from the blast output I will mark your answer as the accepted. Thank you :)

ADD REPLY • link 13.8 years ago by Iván ▴ 60

score 1 · Answer 2 · 2011-09-18

1

Entering edit mode

13.8 years ago

ALchEmiXt ★ 1.9k

Have you reconsidered the improved blast+ suite from NCBI? It allows you to blast directly against a sequence (set) instead of a DB. No hassle of formatting and splitting afterwards. By default it is multi-threaded and therefore usually faster as well in simple setups.

my 2ct

ADD COMMENT • link 13.8 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Problem is I'm using BioPython to parse my .xml output, so I'm not sure I'll be able to extract this info from the new blast+ output! In all cases I'll read more into it, thanks!

ADD REPLY • link 13.8 years ago by Iván ▴ 60

0

Entering edit mode

The output formats are very similar (if not identical). So Python, perl, bash....it should all work the same.

ADD REPLY • link 13.8 years ago by ALchEmiXt ★ 1.9k