Hello,
I am working with shotgun metagenomic data from rhizospheric soil samples. I preprocessed the data by removing low-quality reads and adapters. I also removed human genome contamination (only ~0.08% of reads were filtered out).
For taxonomic classification, I used Kraken2 with a custom database that I built from all NCBI organisms. The database size was ~1.5 TB, so I expected it to be quite comprehensive.
After running Kraken2 on the preprocessed and human-filtered reads, I observed that only ~18% of the reads were classified, while ~82% remained unclassified.
My questions are:
Is it normal to get such a low percentage of classified reads in soil metagenomic data?
Could there be an issue with my database construction or the way I ran Kraken2?
What are the possible reasons why ~80% of my reads remain unclassified despite using a large, comprehensive database?
Any advice, possible explanations, or shared experiences with similar soil metagenomic datasets would be greatly appreciated.
Thanks!
Hard to tell since we don't know how you actually constructed your database or what's actually in it. I would suggest trying using the standard database Kraken2 is distributed with. If the results are significantly higher classification, then it suggests something went wrong during your database creation.
Thanks for your response. For building my custom Kraken2 database, I downloaded the complete nt.gz file from NCBI using the following command: wget -c https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz -P /data/
Then I used this file to construct the database.
The reason I didn’t use the standard Kraken2 database is that my focus is on bacteria, fungi, archaea, and viruses. The standard database (built from RefSeq) only includes RefSeq archaea, bacteria, viral, plasmid, human, and UniVec_Core, but it does not include fungi, which are important for my study. That’s why I opted to build a custom database from nt instead.
Do you think that if I build a custom database restricted to only bacteria, fungi, archaea, and viruses, the percentage of classified reads might increase compared to using the full nt?
To note: files in https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA directory are over a year old at this point. They are no longer being made available for current
nt
database. You will need to getnt
pre-formatted blast indexes and dump out the fasta reads, if you need the latest version.Before you do that perhaps it may be easier to try this.
kraken2
developers make a pre-formattedcore_nt
version (more info aboutcore_nt
is here --> https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/ ) available, which you could try out, in case there was a problem with the database you built from https://benlangmead.github.io/aws-indexes/k2Have you taken some of those reads and done some
blast+
searches via the web interface to see if they return logical hits.What sequences did you use? Genomes from refseq,nt/nr or something else?
Thank you for your response, sir. I will try running BLAST on some of the unclassified reads as you suggested and check what hits they return.