Question: Analyzing Microbial Community Pyrosequence Data
gravatar for Furor
9.4 years ago by
Furor40 wrote:

Dear all,

this might be (is!) very much a n00b question, and I hope you forgive me on this one ;)

To set the context, I'm a biologist, specialised in (terrestrial) ecology with main interests in dispersal and population dynamics. However, I recently started at a microbiology lab to do research on community composition and turnover. Now, I had some microbiology courses during my training, but that's about it. No experience whatsoever with sequencing and analysis hereof.

I am the first one of this lab who will use high throughput techniques (pyrosequencing), but there is already data present (obtained through boarding out). The hyper variable V1-V3 regions of 16S SSU were sequenced. AmpliconNoise was used to clean up the raw data, after which the sequences were run against the RDP-database.

So right now I'm processing this particular dataset. I learnt some Perl to perform some quality checks on the sequences (orientation, length, ...). I removed primers and tags and ran them against the RDP to compare 'my' results with the ones already in the database, not only for sake of double checking, also to get acquainted with the matter ...

To get to the point, how do you analyse this kind and - not in the least - amount of data, just to make sure I'm doing everything the right way (under the presumption everything up to this point was processed correctly)?

People here use phylogenetic trees (which I still need to master) to describe community structure - as is standard practice I assume -, but this seems impractical for this kind of data. So I'm creating pivot tables to compare the presence (and abundances) between samples. I compare the 'raw' (yet AmpliconNoised) data with selections based on sequence length (200, 250, 300) and similarity (more than .95, .97 or .99), separate for both the forward and reverse reads. Further I total the forward and reverse reads (to look for the 'total diversity') and take the highest abundance as the 'correct' one (notice the quotation marks ...).

Does this seem as a correct way to do this? And how do you represent the results (I mean, plots, tables, ...)? At this stage, we really only want to look at diversity (what is present). In a later stage, we want to link the community composition to environmental parameters and find indicator species. Now, one (personal) problem I have with all this is the lack of replicates ... At least with the data I currently have to work with. For future analyses this will be dealt with.

Further, some other questions

  1. Concerning the possible dubious nature of pyrosequenced reads, do you think AmpliconNoise is (good) enough to ensure that the reads are biologically relevant? Where to set limits in cleaning up your data? I feel like using only those reads of 300 bp or higher for robustness' sake, but perhaps this way you miss a lot of (relevant) information?

  2. Do you think it is possible to infer abundances from pyrosequence reads, taking into account the possibility of multiple operons. Amend et al. (2010) put forward the possibility of semi-quantitativeness, implying that abundances are only comparable within a species/OTU between samples.

  3. How do you process raw high-throughput data for community analysis?

  4. Other suggestions? (articles, books, experiences, ...)

Many thanks! Kind regards.

metagenomics • 3.1k views
ADD COMMENTlink modified 9.4 years ago by Larry_Parnell16k • written 9.4 years ago by Furor40
gravatar for Istvan Albert
9.4 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

I found the mothur wiki on OTU based calculators very helpful:

There are also a quite a few analysis examples. Some of these may be a little bit outdated as the latest release changed the naming scheme:

ADD COMMENTlink written 9.4 years ago by Istvan Albert ♦♦ 85k
gravatar for DG
9.4 years ago by
DG7.2k wrote:

Phylogenetic analyses are really the way to go for a reason, and that's because you need something that uses models of evolution to really quantify the diversity and community composition. Simple clustering based on sequence identity is only a crude approximation to robust phylogenetic analyses.

Because this practice is common, it is hardly impractical. There are a variety of alignment programs and phylogenetic tree reconstruction programs that have been optimized for dealing with large amounts of data. Mafft runs quite fast (parallelized) on large amounts of data. FSA is also reasonably fast for large numbers of sequences. FastTree and RAxML-Light both run quite fast on large datasets. Of course there are also a host of other tools out there that have been (and are being) specifically designed for microbial community analysis. Mothur, as mentioned previously is a good example.

ADD COMMENTlink written 9.4 years ago by DG7.2k
gravatar for Larry_Parnell
9.4 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

The phylogenetic analysis has been incorporated in gene model building by Kelley and Salzberg with Glimmer-MG. Folks like us who work at the intersection of nutrition and health want to know the metabolic capacity of the microbial community and so being able to build gene models from partial sequence data is important.

Glimmer-MG is specifically designed to assess functional capacity of a metagenomic sample in an accurate manner. Glimmer-MG uses data from a sample where many sequences cover only a portion of a gene and originate from species that are rare, common or abundant in that sample. Specifically, phylogenetic classifications (as opposed to percent G+C content), sequence clustering, and modeling of both insertion/deletion and substitution sequencing errors altogether yield very highly accurate gene predictions.

Edit added 30 Nov 2011: The Glimmer paper by Kelley and Salzberg is now available here.

ADD COMMENTlink modified 9.0 years ago • written 9.4 years ago by Larry_Parnell16k
gravatar for Kevin Purdy
9.0 years ago by
Kevin Purdy10
Kevin Purdy10 wrote:

I know this is a little late in the day, your question was posed a while back but we have dealt with this problem, and especially the issues that multiple sequence alignment will pose you (it will turn good clean data into noise!) and have just publsihed a data pipeline that takes you to the point of having good OTU clusters defined. I would still say you need to align and phyllogenetically analyse representatives of these clusters to really understand what you are detecting but this is much more achieveable with 10s or even 100s of representatives than 10,000s of pyrosequences! Ref is Oakley et al, ISMEJ DOI:10.1038/ismej.2011.165 (web link is:

Good luck!

ADD COMMENTlink written 9.0 years ago by Kevin Purdy10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1242 users visited in the last hour