this one is a (I guess) tricky question...
RNA virus discovery from metagenome/metatranscriptome dataset (overall from environmental samples) is particularly difficult because of their VERY DIVERGENT genome sequences, with poor relationship with what is available in reference sequence databases.
Can you recommend a "typical" protocol for this?
I found 2 "versions" by now:
#FIRST PROTOCOL# - Assemble reads with Trinity or metaSPAdes. - Do tBLASTn with the generated contigs/scaffolds against a database made of RNA virus proteins (ssRNA and dsRNA viruses). Use an e-value cutoff of <=10-3. - All candidate contigs screened by the previous step are queried against NCBI RefSeq db using BLASTx. - Only contigs with topmost hits to viruses are kept. - Binning to distinct viral groups according to their best blast hits.
#SECOND PROTOCOL# - Assemble reads with Trinity or metaSPAdes. - Do BLASTx with generated contigs/scaffolds against a database made of RNA virus proteins (ssRNA and dsRNA viruses). Use an e-value cutoff of <=10-5. - All candidate contigs are converted into proteins with Prodigal. - The proteins are queried against CDD blast (0.01 cutoff) to look for conserved domains. - Keep the contigs containing domains of RNA-dependent RNA-polymerases or reverse-transcriptases. - Contigs containing those domains are queried against NCBI nr db using BLASTx to discard "false-positives". Only contigs with hits to viruses are kept.
Thanks very much in advanced!