Question: Kraken2 vs Minimap2 and Blast results seem to be incongruent
1
gravatar for cecilio11
9 days ago by
cecilio1130
cecilio1130 wrote:

Hello,

I obtained incongruent "classification" results when running Kraken2 vs. Minimap2 and Blast on the same data sets.

I am planning to assemble 5 bacterial genomes from pacbio reads. For 2 of the bacteria I generated two raw assemblies by using Canu. I suspected that there may be some contamination in the reads, so for each bacterium, I evaluated the raw pacbio reads using minimap2. Additionally, for the 2 draft assemblies I also used NCBI Blast to compare them to whole bacterial genomes as well as to the bacterial 16s rRNA databases. I was advised to use Kraken for a more comprehensive evaluation of possible contaminants.

Comparing the results from Minimap2 and Blast vs. Kraken2 I noticed some incongruence between the matches they report.

For example, Minimap2 indicate that 18.72% of the raw sequences of my Sphingomonas sp. map to Ignatzschineria larvae; 55% map to Sphingomonas paucimobilis. But, Kraken2 does not match Sphingomonas sp. to Ignatzschineria larvae at all. However, Kraken2 does match Sphingomonas sp. to a Sphingomonas (Sphingomonas paucimobilis is not included in the library of ref seqs that kraken builds).

Minimap2 indicates that 78.49% of the raw sequences of my Kurthia sp. map to Moellerella wisconsensis. But, kraken2 does not match my Kurthia sp. to Moellerella wisconsensis at all.

For the raw sequences of my Ignatzschineria sp., Minimap2 indicates that 81% of the raw sequences map to Moellerella wisconsensis. About 20% of the raw seqs map to Providencia. Blast of the draft assembly of my Ignatzschineria sp. to whole genomes and to 16s indicate that my Ignatzschineria may, in fact, be Moellerella wisconsensis. However, kraken2 does not match my Ignatzschineria sp. to Moellerella wisconsensis at all.

I run kraken2 in an IBM cluster.

Can someone give me a hint on why I observe the incongruence between the results reported by these methods?

Best regards,

assembly genome • 143 views
ADD COMMENTlink modified 8 days ago • written 9 days ago by cecilio1130

h.mom and colindaven,

Thank you for your answer. See below.

ADD REPLYlink modified 8 days ago • written 8 days ago by cecilio1130
2
gravatar for h.mon
9 days ago by
h.mon22k
Brazil
h.mon22k wrote:

You are comparing three different programs using three different databases, it is very difficult to pinpoint the cause of the differences. In addition, you do not explain how did you build the Blast and Kraken databases - what are the sources?

In a previous answer, I should have explained why mapping with minimap2 to a limited number of references is a bad approach. You didn't state clearly, but I am assuming you mapped to each reference separately, when using minimap2. This is not a good approach because some reads may map to a given genome when it is the only reference, but these reads wouldn't map to this genome if another, more similar genome, were included in the reference. So, when you say:

Minimap2 indicates that 81% of the raw sequences map to Moellerella wisconsensis. About 20% of the raw seqs map to Providencia.

This may mean about 80% of the reads mapped to Moellerela, and the remaining 20% mapped to Providencia. But it could also mean 60% of the reads mapped to Moellerela, 20% mapped to Moellerela and Providencia, and 20% didn't map at all.

ADD COMMENTlink modified 8 days ago • written 9 days ago by h.mon22k
1

Just as an addon to the great answer above, the methods are dramatically different.

Minimap2 and BLAST are alignment based, whereas Kraken uses kmer matching (and kmers are highly variant with up to 10-15% errors in raw pacbio reads).

" Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm. "

But beyond this difference, the critical point in metagenomics and or contaminant assessment is the database behind this. As I understand it, you used different databases in each case, so results will naturally be dependent on this.

ADD REPLYlink written 8 days ago by colindaven930

h.mom and colindaven,

Thank you for your answer.

For the raw reads, I did run Minimap2 and kraken2 in an IBM cluster. And yes, I run Minimap2 with a single reference sequence per run per bacteria. The percentages refer to each single run. For the draft assemblies, I run Blast (bacterial genomes and 16s bacterial and archaea databases) at the NCBI website (online).

I understand that Kraken2 uses databases downloaded from NCBI, which I do not believe are much different from the databases used by BLAST. The "reference genomes" I used to run Minimap2 were downloaded from the NCBI database.

The difference might be due in part to methodology (as you mention above) and in part to the goal of the tool. Kraken2 is a (ultrafast) classifier. I think Kraken will assign a piece of DNA to a given taxon (the best match), even though, let's say only 30-40% of the k-mers match that taxon and 20% match to another taxon (and perhaps kraken cannot classify the rest of the strand). BLAST and Minimap will show you all the matches.

Thank you for your insights. They were very helpful.

Happy new year to you and all the Biostars.

Cecilio11

ADD REPLYlink modified 8 days ago • written 8 days ago by cecilio1130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour