Viral taxonomic classification question
5 weeks ago

Hello I need to do a taxonomic classification of viral metagenomic sequences, I installed kaiju :

Kaiju provides of standard databases, one them corresponding to viruses from the NCBI RefSeq database (0.49M of sequences), the issue is that using that database I got really low classification rate (2.95% of the paired ended reads were classified); as I'm dealing with viral genomic data, I think that it will be necessary allow kaiju to accept more mismatches given the relative mutation rate of viruses (10e-8 to 10e-6 for DNA viruses and 10e-6 to 10e-4 for RNA viruses). On this way, using kaiju the standard mismataches allowed are 3, so should I increase the allowed mismatches? at which number ?

The other reason of the low taxonomic classification is the employed database, taken the problem from this scope, do you recommend me to use another viral database ?

Thanks for reading :)

Thinking aloud for possible explanations :

RefSeq likely has a good representation of known viral classes so if your sequences are not being classified at top level then you potentially have a) data on new virii, presently unknown (feasible but not certain) b) your data is not good quality and/or is not even from viruses. To test the second theory you could use a broad kaiju database and see if you have non-viral genomes represented. How do you know you have data that is only from viruses? What was the experiment that got you here?

Hello, thanks for your answer :), these viral metagenomic datasets came from estuarine waters at southern fjords in Chile, as the RefSeq virus database has a high representation of viruses as you say, then it could be the case of endemism but still 2.95% of mapping rate results noisy, should I allow more mismatches ? maybe I will try with this program:

Thanks for helping on this discussion :)


