Question: Microbial diversity analysis using whole-genome metagenomic data
gravatar for elsoja
4 weeks ago by
elsoja10 wrote:

I have data, obtained from a single metagenomic DNA sample, that consists of two MiSeq FASTQ files (R1 and R2) that I merged using PEAR.

Now I want to estimate the abundances of the bacteria taxa to generate a figure like this one:

enter image description here

Figure from: Panosyan, Hovik, and Nils‐Kåre Birkeland. "Microbial diversity in an Armenian geothermal spring assessed by molecular and culture‐based methods." Journal of basic microbiology 54.11 (2014): 1240-1250.

The problem is that there wasn't a step of amplification of the 16S region as the goal of the sequencing was to discover new genes. I've already isolated 16S reads from my sample using SortMeRNA, but it seems like softwares that do OTU picking, taxonomic assignment and diversity analyses (such as mothur and QIIME) require that all the reads come from the same region of the 16S gene.

Is there a way of using these 16S reads that I've filtered using SortMeRNA in a diversity analysis using mothur/QIIME?

ADD COMMENTlink modified 4 weeks ago by Brian Bushnell14k • written 4 weeks ago by elsoja10

Cross-posted on StackExchange

ADD REPLYlink written 4 weeks ago by elsoja10
gravatar for Bioinformatics_NewComer
4 weeks ago by
Genomic Island
Bioinformatics_NewComer210 wrote:

If you've access to computing cluster then metagenomic tools like CLARK-S would be good to try. They give you abundances and allow you to perform other analyses.

ADD COMMENTlink written 4 weeks ago by Bioinformatics_NewComer210

Thank you. For what I've read, tools like CLARK, KRAKEN and Kaiju are the answer for my problem.

ADD REPLYlink written 4 weeks ago by elsoja10
gravatar for Brian Bushnell
4 weeks ago by
Walnut Creek, USA
Brian Bushnell14k wrote:

You could try assembling your 16S reads with an assembler that deals well with branches (possibly SPades), then aligning the resulting assemblies to other 16S sequences and trimming off the bases that go off the end (and are thus not 16S bases). I'm not sure how well that would work; depends on the data.

But I think your best bet would be to use the shotgun data as shotgun data, instead of trying to shoehorn it into 16S-based tools. We commonly assemble the whole metagenome, map the reads to the assembly to calculate coverage, and then align the contigs to existing databases like RefSeq to find out what they are. Once you know that contig_123 maps to E.coli, and has coverage of 43x, you can say you probably have 43x coverage of E.coli in your data. Whether this approach works depends on whether you have enough data to assemble; if only, say, 10% of your reads map to the assembly, then it's pretty much a failure and you'll need a different method.

One thing to try in that case is to compare reads directly to RefSeq to find what organisms they came from. You can get a list of organisms observed in your data with BBMap like this: in=data.fq refseq records=400

Once you know which organisms are present, you can download their genomes and map reads to them for quantification purposes. Mapping to all of refseq directly normally takes too much time or memory to be practical.

You might also check out KRAKEN which looks like it is designed for this purpose. I have not tried it, though.

ADD COMMENTlink modified 18 days ago • written 4 weeks ago by Brian Bushnell14k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1393 users visited in the last hour