kraken2 different bacteria read counts on custom database
2
1
Entering edit mode
3.2 years ago

Hi,

Using kraken2, I did two classification tasks on the same sample: one using kraken2 standard database which includes homo sapiens, and the other using a custom database built by kraken2 that doesn't contains homo sapiens. Of the 29 millions reads, I get 16k reads on bacteria when using the standard database (with HS). When using the custom database without HS I get 1.06 million reads on bacteria.

My question is: what should I believe? There is clearly a human contamination in the sample, but when I ignore it in classification I get much more bacterial reads, and much more diversity too. But I am tempted to put my money on the classification using bacteria and human, as for me the read count difference must come from some sequence homology between human and bacteria, where some reads are favored to human when both targets are available.

What do you think? Does my impression fits with kraken's internal alignment algorithm?

thanks!

Phil

metagenomics kraken2 dna-seq • 2.5k views
0
Entering edit mode

I'm curious what happens if you remove the mitochondrial DNA from the reference and re-run. I had a similar problem which I solved, see here: Kraken2 database curation might not be a problem with human though (except for the mitochondria)

0
Entering edit mode

thanks for the info. I did try with a new database not containing human mitochondrial DNA, but the count doesn't change much ...

1
Entering edit mode
2.8 years ago
ilyzdd ▴ 10

Hi,

Have you decontaminated the raw reads before using Kraken2, like using Bowtie2 or BWA to mapping all the reads to the Human reference genome and excluding all the reads that can map? If the sample is from a human stool, in this way, it can make the reads contain fewer human reads.

0
Entering edit mode
3.2 years ago
ctseto ▴ 310

If you like reading kraken --output files, for each contig You might have Bacteria:1 9606:12 0:1000 (where 0 is unclassified) Eliminate the host 9606 and it turns to Bacteria:1 0:1012, the vote switches to Bacteria Eliminate the host 9606 and it turns to Bacteria:N 0:1000+(12-N), the vote switches to Bacteria

I suspect one needs a human "sink" to assure that Kmers have a place to go, vs traversing LCA and ending up somewhere else that they shouldn't be? However, I find it hard to believe that the difference is 16k vs 1,006k bacteria reads with and without human?

In the end, check the first few lines of your kraken.out from both databases and see how the kmer assignments look.

0
Entering edit mode

Looking at the output files I see things like this :

from database with human:

C   NB502083:48:HKTMTAFXY:1:11101:19388:1052    Homo sapiens (taxid 9606)   76|76   9606:3 131567:5 9606:1 131567:1 9606:5 131567:3 9606:24 |:| 9606:21 2759:5 9606:5 2759:6 9606:5


from database without human

C   NB502083:48:HKTMTAFXY:1:11101:19388:1052    1280    76|76   0:3 1280:5 0:1 1280:1 0:5 1280:3 0:24 |:| 0:3 1280:5 0:1 1280:1 0:5 1280:3 0:24


taxon 1280 is Staphylococcus aureus, but there are many kmer not in database '0:'. Taking a closer look at the output I see that to be unclassified both reads must be completely absent from kmer db. I guess from this observation that one is better with the most complete kmer database.

0
Entering edit mode

My interpretation here is that db two without human classifies human as "0" (unclassified. It seems it is 131567 /or/ 1280, depending on the database; at least for Read1 In read2 it is either 9606 or 2759, in your database sans human 0 or 1280. In read 2 the first 21 kmers are human; without human in the db it is a mix of 0 and 1280 and ends with a bunch of unknowns.

In this case I would probably lean towards your first database,

Traffic: 1603 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.