Question

Blastn and Kraken2 short reads clssification

1

Entering edit mode

9 months ago

Giulia.cosenza ▴ 100

Hello everyone,

I'm reaching out to those familiar with nucleic sequence classification software such as Blastn and Kraken2.

After performing shotgun sequencing on DNA extracted from a soil sample, I attempted to classify the obtained reads using Kraken2 software. To do this, I created a database containing sequences of bacteria, viruses, archaea, protozoa, plants, fungi, and the human genome.

The classification went well, and I obtained my report with reads distributed with varying percentages among different organisms.

However, in order to ensure there are no false positives, especially since I'm working with short reads of 30-50 bp, I increased the confidence threshold of Kraken to 0.9.

The puzzling issue is that when I extract the sequences classified as plants with a confidence of 0.9 from the original filtered FASTQ file and run them through Blast, I get completely different results!

Some reads are classified as animals or more generally metazoa, which makes sense because my Kraken database does not contain animal sequences. It's expected that based on similarity, these reads might fall under eukaryotes and occasionally extend to plants.

However, many other reads are classified as bacteria by Blast, and I don't understand why Kraken doesn't produce the same result, especially considering that I significantly increased the confidence threshold.

My question is: Can I consider my plant classification reliable or not?

Thank you for your help!

taxonomy blast kraken • 1.8k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 9 months ago by Giulia.cosenza ▴ 100

1

Entering edit mode

I'm not convinced that this is the cause but plant genomes often contain some bacterial genomes contamination in my experience.

ADD REPLY • link 9 months ago by Asaf 10k

0

Entering edit mode

and run them through Blast, I get completely different results!

What does this BLAST database contain and how are the results different? Are you seeing full length alignments (not short local HSP's) in your blast results?

ADD REPLY • link 9 months ago by GenoMax 141k

0

Entering edit mode

I'm using the blastn suit setting the nucleotide collection (nt) as database and 'Somewhat similar sequences (blastn)' as algorithm. If I change the algorithm to something more specific like 'Highly similar sequences (megablast)' I can't get any match. The most of the match reached with blastn are local and not full length. This seems even more concerning to me because I wonder then how Kraken's classification is possible if indeed my reads do not seem to map closely anywhere, to the point that Blast either doesn't find matches or only finds partial ones, and not even on the same genomes found by Kraken.

ADD REPLY • link 9 months ago by Giulia.cosenza ▴ 100

1

Entering edit mode

Kraken2 (k-mers) and BLAST are using different methods to produce their results. If the blast results are not full length (and the blast database does not contain the same source references) then you can't conclude much of anything. See this past thread for reference: Kraken2 vs Minimap2 and Blast results seem to be incongruent

Note: There is a premade NT database for kraken2 available (480GB) https://genome-idx.s3.amazonaws.com/kraken/k2_nt_20230502.tar.gz if you truly want to directly compare.

ADD REPLY • link 9 months ago by GenoMax 141k

score 2 · Answer 1 · 2023-07-05

2

Entering edit mode

9 months ago

colindaven 6.4k

Having spent a lot of time chasing false positives in my career, I don't think taxonomic assignment from a 30-50bp read is reliable. Why are your reads so short ?

100bp is problematic enough, even far longer nanopore reads are frequently mis-assigned, but 30bp .... that's a no-go. It's 2008 all over again :-)

Consider another tool, such as Centrifuge. Also, I would filter out any reads under 50 bp from your dataset if practical.

ADD COMMENT • link 9 months ago by colindaven 6.4k

1

Entering edit mode

My reads are that long because I work with high degradated samples.

I understand well the difficulty of classifying short sequences that can easily be ubiquitous and found in the genomes of different organisms. And it was exactly for this reason, I chose (perhaps mistakenly) to use Kraken. I imagined that its classification method would tend to advance towards more specific taxonomic subclasses, only for those sequences for which it actually found a good reason to place them in a deeper node of the tree. What I mean is that I expect that if a sequence is common to multiple organisms belonging to the Phylum Actinomycetoa, for example, Kraken2 should place it in the "Actinomycetota" node and not reach the genus "Mycobacterium." Similarly, if a sequence is common to multiple bacterial phyla, I would expect Kraken2 to classify it simply as "Bacteria." And so on.

If this is true, a sequence classified at the species level means that it can only be placed there. Therefore, I wonder how it is possible that the same sequence does not map to the same genome in BLASTn but elsewhere.

I make all these considerations based on the fact that Kraken2 also provides classifications like "cellular organism" or "unclassified." Therefore, if the short sequence were indeed too ubiquitous, I would expect to find it there.

I apologize if I sound presumptuous. I am simply trying to reason through these issues to gain a better understanding and explain my thought process for the analyses. I acknowledge that my reasoning could be flawed, and I seek your assistance in identifying any gaps or errors in my thinking.

ADD REPLY • link 9 months ago by Giulia.cosenza ▴ 100

1

Entering edit mode

Ah I see, yes, degraded samples are a big problem but it leaves you with these massive problems to deal with I'm afraid.

Your reasoning appears sound, but just consider this - the amount of taxonomic signal in a 30bp read is extremely limited. I remember it from the days of 36 bp Illumina reads - aligning reads from an isolate P. aeruginosa to all bacteria, you would get most reads mapping to P. aeruginosa, but then an ever-decreasing number aligned to it's relatives like P. fluorescens, and even unrelated organisms which share genomic islands and their cargo genes. That is, the taxonomic signal was not sufficient for a specific hit. This improved with paired end 100 and 150 bp and longer nanopore reads, but no technology always produces reads which are completely specific.

I would map reads from any apparently positive hits to the genome and check coverage goes over the entire chromosome (eg here: https://github.com/MHH-RCUG/nf_wochenende/wiki/Interpreting-Wochenende-output ) before declaring positive hits with these read lengths.

Also, try to use multiple tools. Blast is certainly not the best at attributing taxonomy.

ADD REPLY • link 9 months ago by colindaven 6.4k

0

Entering edit mode

I understand! I completely agree with what you said about the taxonomic signal not being particularly specific.

However, if my understanding of how Kraken works is correct, the classification of a P. aeruginosa read that also maps to P. fluorescens or even unrelated organisms should place it in a common node like "Gammaproteobacteria" or even just "Bacteria" with Kraken. On the other hand, if a P. aeruginosa read is classified by Kraken as exactly that, I would expect the sequence to be specific. This specific sequence, if run through Blastn, should return P. aeruginosa as a match. Am I correct?

Instead, I'm not finding any matches or even organisms that are very different from the Kraken match.

ADD REPLY • link 9 months ago by Giulia.cosenza ▴ 100

score 0 · Answer 2 · 2023-07-05

My take will be similar to colindaven but with additional steps. I also don't trust short-read classification, which is why I assemble and classify contings instead. There is usually no problem assembling mixed eukaryotic and prokaryotic reads, because they are very distinct. In my experience having bacterial reads does not severely affect a plant genome assembly, and vice versa. On the other hand, longer contigs have more signal in them than short reads, and it is usually very easy to separate the contaminated DNA by t-SNE, UMAP, or any other approach used for metagenomic binning.