Question

Kraken2 Custom Database non-deterministic results

0

Entering edit mode

5 months ago

Bjorn • 0

Hello,

I have two custom databases one with a single phage and one with that single phage and a bacteria. I have done two separate kraken2 runs on the same data but got different amounts of that phage classified each time. This was on the scale of 200 with just phage to 21 to with phage and bacteria.

Why is kraken2 not deterministic? is this a normal occurrence?

I have seen that specific settings for --minimum-hit-groups 4 and --confidence 0.05 may improve the consistency of output. However, I am having a hard time comprehending how the algorithm would work and not give consistent results. Could someone explain this?

Thanks!

Metagenomics kraken2 • 583 views

ADD COMMENT • link updated 5 months ago by colindaven 6.9k • written 5 months ago by Bjorn • 0

0

Entering edit mode

Many (most?) programs related to NGS data analysis produce non-deterministic output (unless they explicitly offer an option to produce deterministic results with a way to provide a seed or an explicit option to ask for deterministic output). In general this is because of use of parallel processing, stochastic algorithms, data handling (multiple I/O streams) and differences in hardware/software (not in your case).

ADD REPLY • link 5 months ago by GenoMax 147k

0

Entering edit mode

Thank you!

ADD REPLY • link 5 months ago by Bjorn • 0

score 0 · Answer 1 · 2024-05-22

You're using 2 different databases, so results are obviously different. This has nothing to do with deterministic or stochastic behaviour.

The tool would be stochastic if you ran the same reads twice vs the same database and got different results. Try that - it should, and will be the same in my experience of that tool.

What you are actually discussing is how database composition affects results in metagenomics (probably using short reads). This is a hugely important problem and has been widely discussed in the literature (termed unknown species problem, causes mismappings for alignment based tools). Basically, many algorithms falsely attribute reads to related species / references / nucleotide sequences if the actual species is missing from the data.

There is a lot of work on assessing mock communities of known composition to test for these accuracy problems. Essentially the taxonomic signal in a short read or read pair is insufficient for a unique alignment in all cases. Sadly, long reads do not alleviate this problem completely - but do help.