Question

Shotgun rarefactions - metagenomics (microbiome), MetaPhlAn2

2

Entering edit mode

7.0 years ago

robert.kwapich ▴ 50

Hi there community!

For some time I was working os 16S rRNA gene survey data. For this type of analysis one could use a rarefaction approach in order to have the same depth for each sample. Having different depths for each sample is sometimes referred to as searching 1 square meter of amazon jungle and 1 square kilometer of mojave desert and then comparing OTUs, taxons, etc... It is relatively easy to employ a rarefaction, as it is implemented in many software packages: qiime, mothur.

I have now a shotgun dataset - a whole genome sequencing of microbiome. For the start I am using a microbiome helper SOP. For taxonomy assignement I use MetaPhlAn2 approach. MetaPhlAn2 wiki doesn't even mention rarefaction. Since this step might be crucial for comparative analyses, where I have two groups/categories, each containing around 30 samples I want to have each sample as "standardized" as possible. Are there any approaches two rarefy WGS data? Is there a reason why I has not been yet implemented in for example MetaPhlAn2?

I'd be grateful for any insight, comments and suggestions.

metagenomics whole genome sequencing shotgun • 5.7k views

ADD COMMENT • link updated 2.8 years ago by serene.s • 0 • written 7.0 years ago by robert.kwapich ▴ 50

1

Entering edit mode

Hi, Did you find any solution to this problem? Any suggestion on how to compute diversity with followed by metaphlan2?

ADD REPLY • link 4.6 years ago by biobiu ▴ 150

0

Entering edit mode

Thanks for this post robert.kwapich, this is a critical step if u wanna compare groups of samples that have been shotgun-metagenome sequenced! My intitial instinct was to rarefy based on single copy housekeeping bacterial genes or the ykaryotic contamination but i dont wanna reinvent the wheel if there is already a method available! Cheers!

ADD REPLY • link 4.2 years ago by sapuizait ▴ 10

score 3 · Answer 1 · 2019-09-09

I followed some methods from the paper: "Unexplored diversity and strain-level structure of the skin microbiome associated with psoriasis", https://www.nature.com/articles/s41522-017-0022-5.pdf.

I remember also checking "Nonpareil" software to estimate the saturation/redundancy of my samples, and each was reaching a nice high percentage for all samples, but one or two, that were discarded.

See Nonpareil: http://enve-omics.ce.gatech.edu/nonpareil/

What I did later was to convert relative abundances (i.e. percentages) to pseudo-counts, i.e. multiply percentages by the number of reads per sample.

This would produce microbial profiles that have different number of observations (i.e. counts) reflecting different sequencing depths. For taxonomy abundance analysis you could then use edgeR implementation of GLMs (see . This method can account of different number of observations.

For alpha and beta diversity I normalized the counts/observations to the same total number of observations, like the maximum. Since all my samples had comparable number of sequences and reached comparable saturation, perhaps this wouldn't introduce many errors.

Nevertheless, the nature paper above uses unique species count for each sample as a measure of richness, and for this, if you have reached similar and high saturation of each sample, we'd not expect much difference. But evenness with Inversed Simpson for example needs to use this normalized pseudo-counts stratified at some level, ex. species.

But it has been some time, and many papers published since then that I didn't follow. So, that is it. If you find out something better, please let me know.