Use taxdb tp get scientific names in local BLAST output
1
0
Entering edit mode
17 days ago
Luca Arbore ▴ 10

Hello,

I have searched this issue around several websites and despite some similar posts I cannot make this work...any help would be greatly appreciated.

For the identification of insect barcodes I have made a local database in which to identify each sequence.

  1. First, I have downloaded the gene sequences for Diptera insect order as .fasta
  2. Then I have made a db using makeblastdb -in diptera_coi.fasta -dbtype 'nucl' -out db/diptera_coi
    1. The taxdb.bti/btd files were downloaded in the /bin where the barcode queries are stored.
    2. I have exported the path with export BLASTDB=/home/alex/ncbi-blast-2.16.0+/bin
    3. executed blastn -query trap_barcodes.fasta -task blastn -db db/diptera_coi -out trap_comparison_blastn.csv -outfmt "6 qseqid qlen sseqid slen sscinames pident evalue bitscore qcovs qcovhsp qstart qend sstart send" -evalue 1e-20 -max_target_seqs 10 -num_threads 20

All I miss from the output is the sscinames which keeps turning out as NA. I would be forever grateful for a clear explanation of what I am missing and doing wrong.

taxdb sscinames taxonomy makeblastdb blast • 665 views
ADD COMMENT
1
Entering edit mode

Can you show the headers of gene sequences you downloaded? It may be best to subset sequences you need from the pre-formatted blast+ database like nt/nr since the headers will be appropriate for use with the taxdb files. You can use blastdbcmd to extract sequences for diptera (taxID: 7147).

Here are general directions on how to build a local blast database with taxID support: https://www.ncbi.nlm.nih.gov/books/NBK569841/

ADD REPLY
0
Entering edit mode

most likely an issue with your fasta headers and/or with the link to your taxdb.

Can you post a small extract of both files?

ADD REPLY
0
Entering edit mode

Thank you for replies. Fast headers of downloaded sequences look like this

grep "^>" diptera_coi_030625.fasta | head -n 5 
>AB085930.1 Asphondylia gennadii mitochondrial COI gene for cytochrome oxidase subunit 1, partial cds, isolate: ChiliPad85
>AB085931.1 Asphondylia gennadii mitochondrial COI gene for cytochrome oxidase subunit 1, partial cds, isolate: ChiliPad86
>AB085932.1 Asphondylia gennadii mitochondrial COI gene for cytochrome oxidase subunit 1, partial cds, isolate: ChiliPad87
>AB085933.1 Asphondylia gennadii mitochondrial COI gene for cytochrome oxidase subunit 1, partial cds, isolate: ChiliPad88
>AB085934.1 Asphondylia gennadii mitochondrial COI gene for cytochrome oxidase subunit 1, partial cds, isolate: ChiliPad89

The instructions at https://www.ncbi.nlm.nih.gov/books/NBK569841/ , namely "If all of the sequences in your database have the same taxid, you can simply use the -taxid flag on makeblastdb to associate all sequences with that taxid rather than needing to prepare a file." are a confusing to me since all my sequences are classified under Diptera, taxid 7147.

I have tried in many way, and so far I suspect that the lack of taxid in the retrieved fasta is what's preventing me. Nonetheless, I am in over my head with this one.

ADD REPLY
0
Entering edit mode
15 days ago
GenoMax 152k

With blast+ v.2.16.

Get the sequences for diptera from "nt" blast db.

$ blastdbcmd -db nt -taxids 7147 -outfmt %f > test.fa

While the default taxID database files should work let us not take any chances. We can extract the accession numbers for the diptera sequences along with their taxID's in a messy way (this is not the only way but we will go with this)

$ blastdbcmd -db nt -taxids 7147 -outfmt %asep%T > intermed_file

Then we can convert this file to create a taxid_map file that is tab delimited

$ sed s/"sep"/"\t"/g intermed_file > test_map.txt

Now ready to create the local blast db for diptera with taxonomy info

$ makeblastdb -in test.fa -dbtype nucl -out dipt_db -parse_seqids -taxid_map test_map.txt

Try out a test search

$ more query.fa 
>test
TCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCACTAGCGTATATTAAAATTGTTGCGGTTAAAACGTTCGAAGTTTATT
CTTGTCCAACACGGGTGCTACTCCTTTATGATGGCAGTAGGTCACTGGATTGTTGCGACTATAAGACTGGGTGCGCCCGT
CGGCCTCGCGGTCGGCGCGGTCGTAGTGTGGCGCTGATGCCTTTCATCGGGTGCAGTGTTTCCGCAAGCCCAGCTGCTAT
TACCTTGAACAAATTAGAGTGCTCTAAGCAGGCTATCCTACGGCCGAGAATAACTTGCATGGAATAATGGAATATGACCT
CGGTCTTAATATTCATTGGTTTGTAATCAGATCAAGAGGTAATGATTAACAGAAGTAGTTGGGGGCATTAGTATTACGGC
GCGAGAGGTGAAATTCGTAGACCGTCGTAAGACTAACTAAAGCGAAACGATTTGCCATGGATGCTTTCATTAATCAAGAA
CGAAAGTTAGAGGATCGAGGCGATTAGATACCGCCCTAGTTCTAACCGTAAACTATGCCAATTAGCAATTGGGAGACGCT

Actual search

$ blastn -query query.fa -task blastn -db dipt_db -out blastn2.csv -outfmt "6 qseqid qlen sseqid slen scomname pident evalue bitscore qcovs qcovhsp qstart qend sstart send"

The resulting file (only a portion shown)

$ more blastn2.csv 
test    560     emb|X57172.1|   1950    Asian tiger mosquito    100.000 0.0     1011    100     100     1       560     561     1120
test    560     gb|U65375.1|    1735    yellow fever mosquito   98.404  0.0     959     100     100     1       560     563     1125
test    560     gb|L78065.1|    8312    Anopheles albimanus     83.080  5.52e-164       576     100     100     1       560     2109    2679
test    560     gb|U07981.1|    2385    Eucorethra underwoodi   81.720  2.68e-155       547     99      99      1       554     526     1052
test    560     gb|AF033949.1|  612     Bactrocera xanthodes    85.893  1.14e-96        352     61      56      239     553     218     528
test    560     gb|AF033949.1|  612     Bactrocera xanthodes    96.552  6.35e-05        49.1    61      5       42      70      4       32
test    560     gb|AF033948.1|  563     Bactrocera umbrosa      85.893  1.14e-96        352     56      56      239     553     198     508
test    560     gb|AF033945.1|  614     Bactrocera xanthodes    85.893  1.14e-96        352     62      56      239     553     219     529
test    560     gb|AF033945.1|  614     Bactrocera xanthodes    96.667  1.82e-05        50.9    62      5       42      71      5       34
test    560     gb|AF033943.1|  622     Bactrocera xanthodes    85.893  1.14e-96        352     61      56      239     553     228     538
test    560     gb|AF033943.1|  622     Bactrocera xanthodes    91.667  0.40    35.6    61      4       48      71      20      43
test    560     gb|AF033941.1|  611     oriental fruit fly      85.893  1.14e-96        352     62      56      239     553     218     528
ADD COMMENT
0
Entering edit mode

Thank you very much Genomax. Now, because there may be a misunderstanding I need to clarify the following: I do not have the nt database downloaded on my computer. I am starting by downloading the Diptera COI gene from https://www.ncbi.nlm.nih.gov/nuccore with

Diptera[Organism] AND (COI[Gene] OR CO1[Gene] OR "cytochrome oxidase subunit I"[Gene])

Is that the reason for which I do not get the taxids? Instead, first command returns

BLAST Database error: No alias or index file found for nucleotide database [nt] in search path [/home/alex/ncbi-blast-2.16.0+/bin::] 

As you can se above in the first reply, there is no taxid info in .fasta. Do I need to get the taxids in a different way in this case?

thanks again

ADD REPLY
1
Entering edit mode

Do I need to get the taxids in a different way in this case?

Yes. You can get the accession to taxID mapping file from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

Then use following to make the taxmap file (hopefully all of your accession will be covered by this file)

$ zmore nucl_gb.accession2taxid.gz | cut -f2,3 | sed 1d > tax_map.txt

Use -taxid_map tax_map.txt file when you make the local blast database.

Note: You will need taxdb (https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz ) file from blast+ preformatted databases for this to work.

ADD REPLY
0
Entering edit mode

Thank you very much, it worked.

ADD REPLY
1
Entering edit mode

Please consider accepting the answer (green check mark) to provide closure to this thread.

ADD REPLY

Login before adding your answer.

Traffic: 3603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6