Hello,
I am working with shotgun metagenomic data from rhizospheric soil samples. I preprocessed the data by removing low-quality reads and adapters. I also removed human genome contamination (only ~0.08% of reads were filtered out).
For taxonomic classification, I used Kraken2 with a custom database that I built from all NCBI organisms. The database size was ~1.5 TB, so I expected it to be quite comprehensive.
After running Kraken2 on the preprocessed and human-filtered reads, I observed that only ~18% of the reads were classified, while ~82% remained unclassified.
My questions are:
Is it normal to get such a low percentage of classified reads in soil metagenomic data?
Could there be an issue with my database construction or the way I ran Kraken2?
What are the possible reasons why ~80% of my reads remain unclassified despite using a large, comprehensive database?
Any advice, possible explanations, or shared experiences with similar soil metagenomic datasets would be greatly appreciated.
Thanks!
Hard to tell since we don't know how you actually constructed your database or what's actually in it. I would suggest trying using the standard database Kraken2 is distributed with. If the results are significantly higher classification, then it suggests something went wrong during your database creation.
Have you taken some of those reads and done some
blast+
searches via the web interface to see if they return logical hits.What sequences did you use? Genomes from refseq,nt/nr or something else?