Question: How To Differentiate Databases When Blasting From The Command Line?
gravatar for Iván
9.5 years ago by
Iván20 wrote:


I'm creating a python script to take each ORF from a genome and blasting it against 3 other genomes in order to retrieve common and unique ORFs for the query genome.

I am using Bio.NCBIXML to parse the blast results and am runnning the blast command using os.system as following: blastall -p blastn -e 1e-05 -i input.fasta -d "db1 db2 db3" -m 7 -o output.xml

So as you can see I formated the genomes separately as databases db1 to db3. My question is how can I know in which database, ie. for which genome, is my query ORF aligning? I thought I could extract this info from the xml tag BlastOutput_db, but it just shows all the databases as one string, like this: <BlastOutput_db>db1 db2 db3</BlastOutput_db> for each alignment, so it's no good!

This question is directly related to this, but the solution provided doesn't work as I pointed out the BlastOutput tag just lists all databases passed to the blast command line.

One other solution as suggested on the other thread would be to take each fasta header from the subject genomes (.ffn file) and convert to something like >db1|orf annotation, >db2|orf annotation, which I could then extract, but maybe there's a easier way, since each genome has around 5,000 ORFs so I would have to change 15,000 fasta headers.

xml parsing blast python • 2.1k views
ADD COMMENTlink written 9.5 years ago by Iván20
gravatar for Yannick Wurm
9.5 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

Your suggestions seems easy. To change the fasta headers, do eg:

sed 's/^>/>db1|/' < inputdb1.fasta > outputdb1.fasta


ADD COMMENTlink modified 17 months ago by Ram32k • written 9.5 years ago by Yannick Wurm2.3k

Hmm, easy indeed! I hadn't thought of sed'ing everything in one go. If no one else has a way of directly extracting the db id hit from the blast output I will mark your answer as the accepted. Thank you :)

ADD REPLYlink written 9.5 years ago by Iván20
gravatar for ALchEmiXt
9.5 years ago by
The Netherlands
ALchEmiXt1.9k wrote:

Have you reconsidered the improved blast+ suite from NCBI? It allows you to blast directly against a sequence (set) instead of a DB. No hassle of formatting and splitting afterwards. By default it is multi-threaded and therefore usually faster as well in simple setups.

my 2ct

ADD COMMENTlink written 9.5 years ago by ALchEmiXt1.9k

Problem is I'm using BioPython to parse my .xml output, so I'm not sure I'll be able to extract this info from the new blast+ output! In all cases I'll read more into it, thanks!

ADD REPLYlink written 9.5 years ago by Iván20

The output formats are very similar (if not identical). So Python, perl, should all work the same.

ADD REPLYlink written 9.5 years ago by ALchEmiXt1.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2334 users visited in the last hour