With blast+ v.2.16
.
Get the sequences for diptera
from "nt" blast db.
$ blastdbcmd -db nt -taxids 7147 -outfmt %f > test.fa
While the default taxID database files should work let us not take any chances. We can extract the accession numbers for the diptera sequences along with their taxID's in a messy way (this is not the only way but we will go with this)
$ blastdbcmd -db nt -taxids 7147 -outfmt %asep%T > intermed_file
Then we can convert this file to create a taxid_map file that is tab delimited
$ sed s/"sep"/"\t"/g intermed_file > test_map.txt
Now ready to create the local blast db for diptera with taxonomy info
$ makeblastdb -in test.fa -dbtype nucl -out dipt_db -parse_seqids -taxid_map test_map.txt
Try out a test search
$ more query.fa
>test
TCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCACTAGCGTATATTAAAATTGTTGCGGTTAAAACGTTCGAAGTTTATT
CTTGTCCAACACGGGTGCTACTCCTTTATGATGGCAGTAGGTCACTGGATTGTTGCGACTATAAGACTGGGTGCGCCCGT
CGGCCTCGCGGTCGGCGCGGTCGTAGTGTGGCGCTGATGCCTTTCATCGGGTGCAGTGTTTCCGCAAGCCCAGCTGCTAT
TACCTTGAACAAATTAGAGTGCTCTAAGCAGGCTATCCTACGGCCGAGAATAACTTGCATGGAATAATGGAATATGACCT
CGGTCTTAATATTCATTGGTTTGTAATCAGATCAAGAGGTAATGATTAACAGAAGTAGTTGGGGGCATTAGTATTACGGC
GCGAGAGGTGAAATTCGTAGACCGTCGTAAGACTAACTAAAGCGAAACGATTTGCCATGGATGCTTTCATTAATCAAGAA
CGAAAGTTAGAGGATCGAGGCGATTAGATACCGCCCTAGTTCTAACCGTAAACTATGCCAATTAGCAATTGGGAGACGCT
Actual search
$ blastn -query query.fa -task blastn -db dipt_db -out blastn2.csv -outfmt "6 qseqid qlen sseqid slen scomname pident evalue bitscore qcovs qcovhsp qstart qend sstart send"
The resulting file (only a portion shown)
$ more blastn2.csv
test 560 emb|X57172.1| 1950 Asian tiger mosquito 100.000 0.0 1011 100 100 1 560 561 1120
test 560 gb|U65375.1| 1735 yellow fever mosquito 98.404 0.0 959 100 100 1 560 563 1125
test 560 gb|L78065.1| 8312 Anopheles albimanus 83.080 5.52e-164 576 100 100 1 560 2109 2679
test 560 gb|U07981.1| 2385 Eucorethra underwoodi 81.720 2.68e-155 547 99 99 1 554 526 1052
test 560 gb|AF033949.1| 612 Bactrocera xanthodes 85.893 1.14e-96 352 61 56 239 553 218 528
test 560 gb|AF033949.1| 612 Bactrocera xanthodes 96.552 6.35e-05 49.1 61 5 42 70 4 32
test 560 gb|AF033948.1| 563 Bactrocera umbrosa 85.893 1.14e-96 352 56 56 239 553 198 508
test 560 gb|AF033945.1| 614 Bactrocera xanthodes 85.893 1.14e-96 352 62 56 239 553 219 529
test 560 gb|AF033945.1| 614 Bactrocera xanthodes 96.667 1.82e-05 50.9 62 5 42 71 5 34
test 560 gb|AF033943.1| 622 Bactrocera xanthodes 85.893 1.14e-96 352 61 56 239 553 228 538
test 560 gb|AF033943.1| 622 Bactrocera xanthodes 91.667 0.40 35.6 61 4 48 71 20 43
test 560 gb|AF033941.1| 611 oriental fruit fly 85.893 1.14e-96 352 62 56 239 553 218 528
Can you show the headers of gene sequences you downloaded? It may be best to subset sequences you need from the pre-formatted blast+ database like
nt/nr
since the headers will be appropriate for use with the taxdb files. You can useblastdbcmd
to extract sequences for diptera (taxID: 7147).Here are general directions on how to build a local blast database with taxID support: https://www.ncbi.nlm.nih.gov/books/NBK569841/
most likely an issue with your fasta headers and/or with the link to your taxdb.
Can you post a small extract of both files?
Thank you for replies. Fast headers of downloaded sequences look like this
The instructions at https://www.ncbi.nlm.nih.gov/books/NBK569841/ , namely "If all of the sequences in your database have the same taxid, you can simply use the -taxid flag on makeblastdb to associate all sequences with that taxid rather than needing to prepare a file." are a confusing to me since all my sequences are classified under Diptera, taxid 7147.
I have tried in many way, and so far I suspect that the lack of taxid in the retrieved fasta is what's preventing me. Nonetheless, I am in over my head with this one.