Hello,
I'm creating a python script to take each ORF from a genome and blasting it against 3 other genomes in order to retrieve common and unique ORFs for the query genome.
I am using Bio.NCBIXML to parse the blast results and am runnning the blast command using os.system
as following: blastall -p blastn -e 1e-05 -i input.fasta -d "db1 db2 db3" -m 7 -o output.xml
So as you can see I formated the genomes separately as databases db1
to db3
. My question is how can I know in which database, ie. for which genome, is my query ORF aligning? I thought I could extract this info from the xml tag BlastOutput_db, but it just shows all the databases as one string, like this: <BlastOutput_db>db1 db2 db3</BlastOutput_db>
for each alignment, so it's no good!
This question is directly related to this, but the solution provided doesn't work as I pointed out the BlastOutput tag just lists all databases passed to the blast command line.
One other solution as suggested on the other thread would be to take each fasta header from the subject genomes (.ffn file) and convert to something like >db1|orf annotation
, >db2|orf annotation
, which I could then extract, but maybe there's a easier way, since each genome has around 5,000 ORFs so I would have to change 15,000 fasta headers.
Hmm, easy indeed! I hadn't thought of sed'ing everything in one go. If no one else has a way of directly extracting the db id hit from the blast output I will mark your answer as the accepted. Thank you :)