Hi all,
I clustered my ASVs into OTUs at 97% similarity using QIIME2's cluster-features-de-novo:
qiime vsearch cluster-features-de-novo \
--i-table asv_table.qza \
--i-sequences asv_sequences.qza \
--p-perc-identity 0.97 \
--o-clustered-table otu_table_97.qza \
--o-clustered-sequences otu_sequences_97.qza
For taxonomy assignment of these OTUs, I used standalone VSEARCH (not QIIME2) with the following parameters:
vsearch --usearch_global dna-sequences.fasta \
--db SILVA_138.1_SSURef_NR99_tax_silva_fixed.fasta \
--id 0.90 \
--maxaccepts 20 \
--maxrejects 0 \
--output_no_hits \
--blast6out amf_taxonomy_vsearch_max20.csv \
--top_hits_only
However, even with --top_hits_only
, I’m getting multiple hits per OTU ID in the blast6out result. I suspect this happens when multiple reference sequences match equally well (i.e., identical percent identity and bit score).
Questions:
Is it expected that --top_hits_only
still yields multiple hits when --maxaccepts
is set to 20? I thought it would limit to one.
What is the best practice to retain only one best taxonomy hit per OTU in this case? Should I manually select?
Any suggestions for handling the long SILVA taxonomy strings with many levels? How can I clean or truncate them to a 7-level (kingdom to species) hierarchy in R?
Would doing taxonomy assignment using QIIME2’s classify-consensus-vsearch help avoid this issue, or does it behave similarly?
Any insight from your experience or pipeline practices would be highly appreciated.