Question

Multiple taxonomy hits per OTU ID in VSEARCH despite --top_hits_only (90% ID after QIIME2 clustering at 97%)

0

Entering edit mode

3 months ago

salma.sarker • 0

Hi all,

I clustered my ASVs into OTUs at 97% similarity using QIIME2's cluster-features-de-novo:

qiime vsearch cluster-features-de-novo \
  --i-table asv_table.qza \
  --i-sequences asv_sequences.qza \
  --p-perc-identity 0.97 \
  --o-clustered-table otu_table_97.qza \
  --o-clustered-sequences otu_sequences_97.qza

For taxonomy assignment of these OTUs, I used standalone VSEARCH (not QIIME2) with the following parameters:

vsearch --usearch_global dna-sequences.fasta \
  --db SILVA_138.1_SSURef_NR99_tax_silva_fixed.fasta \
  --id 0.90 \
  --maxaccepts 20 \
  --maxrejects 0 \
  --output_no_hits \
  --blast6out amf_taxonomy_vsearch_max20.csv \
  --top_hits_only

However, even with --top_hits_only, I’m getting multiple hits per OTU ID in the blast6out result. I suspect this happens when multiple reference sequences match equally well (i.e., identical percent identity and bit score).

Questions: Is it expected that --top_hits_only still yields multiple hits when --maxaccepts is set to 20? I thought it would limit to one.

What is the best practice to retain only one best taxonomy hit per OTU in this case? Should I manually select?

Any suggestions for handling the long SILVA taxonomy strings with many levels? How can I clean or truncate them to a 7-level (kingdom to species) hierarchy in R?

Would doing taxonomy assignment using QIIME2’s classify-consensus-vsearch help avoid this issue, or does it behave similarly?

Any insight from your experience or pipeline practices would be highly appreciated.

vsearch • 1.1k views

ADD COMMENT • link updated 3 months ago by GenoMax 153k • written 3 months ago by salma.sarker • 0