Question

Taxonomic sequence classification for shotgun metagenomics, compatible with downstream statistical analyses

4

Entering edit mode

4.5 years ago

DriesB ▴ 110

Hello Biostars,

I've been given occasion to plunge into the metagenomics field. I'm working with shotgun sequencing data, so not targeted at 16S or another gene. From reading several articles, including this review by Siegwald et al., and the websites of QIIME and mothur, it seems that targeted (non-shotgun) analyses are the standard, but that I cannot use the same software to analyse shotgun sequencing data. ~~Q1: Is this correct?~~

To analyze shotgun sequencing, I've so far identified two tools: Kraken2 and MetaPhlAn2. I know that there are other tools available such as CLARK, but they do not seem to be significantly different from Kraken2.

After trying Kraken2, it only seems to support export to the Krona visualization tool for downstream analysis. So I now intend to try MetaPhlAn2. Output from this tool should be compatible with the R packages phyloseq and microbiome, which seem to be standard tools for downstream statistical analyses of metagenomes. ~~Q2: Are the observations and imputations in this paragraph correct?~~ Q3: Should I indeed be aiming for compatibility with phyloseq and microbiome, or are there alternatives for downstream statistical analysis? Q4: Is it indeed not possible to do further statistical analysis on Kraken2's output?

Finally, the datasets which I'm starting my analyses from are not so large, so the speed advantage of Kraken2 compared to other tools does not matter much to me. I'm mostly interested in increasing the sensitivity of my analysis. Human reads have already been removed from the datasets (BAMs) by alignment, although Kraken2 could still identify a significant share.

Thank you for your time!

P.S. I've read this recent review by Ye et al., but it mostly discusses taxonomic classifiers performance, not the possibilities for downstream statistical analysis.

metagenomics shotgun sequencing Kraken2 phyloseq • 4.0k views

ADD COMMENT • link updated 3.2 years ago by obriek11 • 0 • written 4.5 years ago by DriesB ▴ 110

1

Entering edit mode

Your post is too long and with too many questions, so I will answer them superficially / partially - maybe someone will chime in with a more detailed answer. A better approach is to post more specific questions, I believe you will increase the odds of getting better answers.

Q1. So this means I will not be able to use all standard tools.

What are the standard tools you are referring to? Regular 16S pipelines output a table of taxonomic identifications and their counts, one can certainly coerce Kraken output to the same format, so you could use all the same tools. However, there are several analyses where it doesn't make sense to use these approaches, e.g., using PICRUSt (or similar methods) to make functional predictions from taxonomic distribution, because one can assemble, predict genes and make functional predictions from the annotated metagenome.

Q2: Are the observations and imputations in this paragraph correct?

I don't know.

Q3: Should I indeed be aiming for compatibility with phyloseq and microbiome?

Not necessarily, but phyloseq (which I know more) has outstanding documentation, so it is convenient to use it.

Q4: Is it indeed not possible to do further statistical analysis on Kraken output?

Further than what? Just alpha- and -beta-diversity?

ADD REPLY • link 4.5 years ago by h.mon 35k

0

Entering edit mode

Thank you h.mon! As far as I'm concerned, this could definitely be a stand-alone answer.

I've chosen for a long forum post, because this means I will not have to explain the context multiple times. Also, I think all these questions belong together when trying to do a full statistic analysis on a shotgun sequencing metagenomic dataset. I have made some edits to improve readability, however, based on your comment.

ADD REPLY • link 4.5 years ago by DriesB ▴ 110

0

Entering edit mode

A1. What are the standard tools you are referring to?

With these, I'm refering mostly to QIIME and mothur. These give BIOM files as an output directly if I'm not mistaken and are easy to couple to phyloseq and microbiome. When starting my literature session, I thought I might be able to use QIIME and mothur, but these are aimed at the analysis of target genes (López-Garcia et al., SEQanswers)

A1. However, there are several analyses where it doesn't make sense to use these approaches, e.g., using PICRUSt (or similar methods) to make functional predictions from taxonomic distribution

No this is not what I'm looking into, as this post revolves around taxonomic sequence classification.

A4. Further than what? Just alpha- and -beta-diversity?

Yes indeed, but I have not found standard methods for Kraken2 to calculate these scores. The kraken-biom package might be suitable for linking to downstream statistical analysis. I find it surprising, however, that neither Kraken nor phyloseq implements this package themselves.

ADD REPLY • link 4.5 years ago by DriesB ▴ 110

0

Entering edit mode

EDIT: Deleted unnecessary information, striked out unclear questions. Reintroduced whitespace to increase readability.

ADD REPLY • link 4.5 years ago by DriesB ▴ 110

score 4 · Answer 1 · 2019-10-25

If you have 16S sequence data then QIIME or mothur are the most populate options. For shotgun sequencing data there are several options as well and I understand that it can get pretty overwhelming trying to choose the right tool for your project.

I don't have any experience with Kraken2 but according to Kraken2's documentation it outputs abundance counts for each taxon it identifies. This kind of output is pretty standard among metagenomic tools and a lot of downstream metagenomic analysis packages like vegan, phyloseq, metagenomeSeq etc. can work with the taxon or OTU abundance matrices.

I have experience using Metaxa2 in one of my projects. It basically extracts 16S rRNA sequences (uses BLAST and MAFFT) from shotgun sequencing data and assigns a taxonomy classification based on the taxonomy database you provide. You can basically use SILVA, Geengenes or other 16S rRNA databases. I ended up using SILVA database with Metaxa2 for my project as I am trying to find out the bacterial diversity but you should pick a database depending on what you are trying to classify (bacteria, viruses, fungi etc.). Metaxa2 also comes with a suite of analysis tools that you can use to build an abundance matrix that you can use with various R packages for further analysis. I ended up using vegan to calculate diversity indices and metagenomeSeq for statistical analysis with Metaxa2 output.

Metaxa2 could be a bit slow as it uses command-line BLAST to perform searches, but I would suggest you to run your sample against two or three different tools to figure out the right one for you. Good Luck!

score 0 · Answer 2 · 2021-01-28

0

Entering edit mode

3.2 years ago

obriek11 • 0

Hello,

I think my course of action for the same situation is to convert the kraken2 classified report file into a biom table using the kraken-biom package. After that I would import the biom table into phyloseq and analysis from there is fairly standard.

ADD COMMENT • link 3.2 years ago by obriek11 • 0