How To Differentiate Databases When Blasting From The Command Line?
2
2
Entering edit mode
12.4 years ago
Iván ▴ 60

Hello,

I'm creating a python script to take each ORF from a genome and blasting it against 3 other genomes in order to retrieve common and unique ORFs for the query genome.

I am using Bio.NCBIXML to parse the blast results and am runnning the blast command using os.system as following: blastall -p blastn -e 1e-05 -i input.fasta -d "db1 db2 db3" -m 7 -o output.xml

So as you can see I formated the genomes separately as databases db1 to db3. My question is how can I know in which database, ie. for which genome, is my query ORF aligning? I thought I could extract this info from the xml tag BlastOutput_db, but it just shows all the databases as one string, like this: <BlastOutput_db>db1 db2 db3</BlastOutput_db> for each alignment, so it's no good!

This question is directly related to this, but the solution provided doesn't work as I pointed out the BlastOutput tag just lists all databases passed to the blast command line.

One other solution as suggested on the other thread would be to take each fasta header from the subject genomes (.ffn file) and convert to something like >db1|orf annotation, >db2|orf annotation, which I could then extract, but maybe there's a easier way, since each genome has around 5,000 ORFs so I would have to change 15,000 fasta headers.

xml parsing blast python • 2.9k views
ADD COMMENT
5
Entering edit mode
12.4 years ago
Yannick Wurm ★ 2.5k

Your suggestions seems easy. To change the fasta headers, do eg:

sed 's/^>/>db1|/' < inputdb1.fasta > outputdb1.fasta

Cheers
y

ADD COMMENT
0
Entering edit mode

Hmm, easy indeed! I hadn't thought of sed'ing everything in one go. If no one else has a way of directly extracting the db id hit from the blast output I will mark your answer as the accepted. Thank you :)

ADD REPLY
1
Entering edit mode
12.4 years ago
ALchEmiXt ★ 1.9k

Have you reconsidered the improved blast+ suite from NCBI? It allows you to blast directly against a sequence (set) instead of a DB. No hassle of formatting and splitting afterwards. By default it is multi-threaded and therefore usually faster as well in simple setups.

my 2ct

ADD COMMENT
0
Entering edit mode

Problem is I'm using BioPython to parse my .xml output, so I'm not sure I'll be able to extract this info from the new blast+ output! In all cases I'll read more into it, thanks!

ADD REPLY
0
Entering edit mode

The output formats are very similar (if not identical). So Python, perl, bash....it should all work the same.

ADD REPLY

Login before adding your answer.

Traffic: 2199 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6