Low percentage of classified reads (~18%) after Kraken2 analysis – is this expected?
0
0
Entering edit mode
3 days ago

Hello,

I am working with shotgun metagenomic data from rhizospheric soil samples. I preprocessed the data by removing low-quality reads and adapters. I also removed human genome contamination (only ~0.08% of reads were filtered out).

For taxonomic classification, I used Kraken2 with a custom database that I built from all NCBI organisms. The database size was ~1.5 TB, so I expected it to be quite comprehensive.

After running Kraken2 on the preprocessed and human-filtered reads, I observed that only ~18% of the reads were classified, while ~82% remained unclassified.

My questions are:

Is it normal to get such a low percentage of classified reads in soil metagenomic data?

Could there be an issue with my database construction or the way I ran Kraken2?

What are the possible reasons why ~80% of my reads remain unclassified despite using a large, comprehensive database?

Any advice, possible explanations, or shared experiences with similar soil metagenomic datasets would be greatly appreciated.

Thanks!

Kraken2 metagenomics shotgun • 733 views
ADD COMMENT
1
Entering edit mode

Hard to tell since we don't know how you actually constructed your database or what's actually in it. I would suggest trying using the standard database Kraken2 is distributed with. If the results are significantly higher classification, then it suggests something went wrong during your database creation.

ADD REPLY
0
Entering edit mode

Thanks for your response. For building my custom Kraken2 database, I downloaded the complete nt.gz file from NCBI using the following command: wget -c https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz -P /data/

Then I used this file to construct the database.

The reason I didn’t use the standard Kraken2 database is that my focus is on bacteria, fungi, archaea, and viruses. The standard database (built from RefSeq) only includes RefSeq archaea, bacteria, viral, plasmid, human, and UniVec_Core, but it does not include fungi, which are important for my study. That’s why I opted to build a custom database from nt instead.

Do you think that if I build a custom database restricted to only bacteria, fungi, archaea, and viruses, the percentage of classified reads might increase compared to using the full nt?

ADD REPLY
0
Entering edit mode

To note: files in https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA directory are over a year old at this point. They are no longer being made available for current nt database. You will need to get nt pre-formatted blast indexes and dump out the fasta reads, if you need the latest version.

custom database restricted to only bacteria, fungi, archaea, and viruses

Before you do that perhaps it may be easier to try this.

kraken2 developers make a pre-formatted core_nt version (more info about core_nt is here --> https://ncbiinsights.ncbi.nlm.nih.gov/2024/07/18/new-blast-core-nucleotide-database/ ) available, which you could try out, in case there was a problem with the database you built from https://benlangmead.github.io/aws-indexes/k2

ADD REPLY
0
Entering edit mode

why ~80% of my reads remain unclassified

Have you taken some of those reads and done some blast+ searches via the web interface to see if they return logical hits.

I used Kraken2 with a custom database that I built from all NCBI organisms.

What sequences did you use? Genomes from refseq,nt/nr or something else?

ADD REPLY
0
Entering edit mode

Thank you for your response, sir. I will try running BLAST on some of the unclassified reads as you suggested and check what hits they return.

ADD REPLY

Login before adding your answer.

Traffic: 2704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6