Question: Interpretation of Kraken results
gravatar for ropolocan
22 months ago by
ropolocan600 wrote:

Dear Biostars community:

I have been using Kraken extensively for the characterization of microbiomes. My colleague and I have a bit of a disagreement on how to interpret the results from the Kraken reports.

For example, let’s imagine we have a subset of a Kraken report that looks like this:

1.93    104417  104105  P   1224    Proteobacteria

0.18    96419   1968    P   201174  Actinobacteria

0.17    80738   10469   P   1239    Firmicutes

The columns of the report, according to the Kraken manual are:

1. Percentage of reads covered by the clade rooted at this taxon
2. Number of reads covered by the clade rooted at this taxon
3. Number of reads assigned directly to this taxon
4. A taxonomy rank code
5. NCBI taxonomy ID
6. indented scientific name

In this example subset, if one looks at column #2 and #3, we get different answers as to which taxon has more reads (i.e. which one is more abundant). If I use column #2, I can say more Actinobacteria reads were detected than Firmicutes reads. However, if I look at column #3, then the opposite is true.

If one was to use read count as proxy for abundance, which column of the Kraken report is more appropriate to use: column 2 or column 3? In my opinion column #2 is more appropriate, but my collaborator seems to think it is #3. I think column #2 is more appropriate because it is the sum of the reads that were specific to the particular taxon, plus all the reads that are part of the same clade at which said taxon is rooted. I would be very interested in seeing what you think.

ADD COMMENTlink modified 22 months ago by Joseph Hughes2.8k • written 22 months ago by ropolocan600

Column #2 should be used in my opinion. Column #3 is just the sum of reads unassigned in lower taxonomic levels.

ADD REPLYlink written 22 months ago by Asaf6.4k

I agree. This makes sense. Thanks, @Asaf.

ADD REPLYlink written 22 months ago by ropolocan600
gravatar for Joseph Hughes
22 months ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

It makes much more sense to use the results from column 1 and 2 as these represent all taxa that are assigned to (for example) Proteobacteria and any descendant taxa of Proteobacteria, e.g. Acidithiobacillia and Alphaproteobacteria etc... If using column 3, you would be looking at k-mers that are only assigned to Proteobacteria and not to any descendant nodes. This is equivalent to only counting reads with this type of assignment in NCBI (see ORGANISM):

ADD COMMENTlink written 22 months ago by Joseph Hughes2.8k

Agreed. Thanks for your answer, @Joseph Hughes.

ADD REPLYlink written 22 months ago by ropolocan600
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1234 users visited in the last hour