Question

Microbial diversity analysis using whole-genome metagenomic data

1

Entering edit mode

7.2 years ago

Antonio Camargo ▴ 160

I have data, obtained from a single metagenomic DNA sample, that consists of two MiSeq FASTQ files (R1 and R2) that I merged using PEAR.

Now I want to estimate the abundances of the bacteria taxa to generate a figure like this one:

Figure from: Panosyan, Hovik, and Nils‐Kåre Birkeland. "Microbial diversity in an Armenian geothermal spring assessed by molecular and culture‐based methods." Journal of basic microbiology 54.11 (2014): 1240-1250.

The problem is that there wasn't a step of amplification of the 16S region as the goal of the sequencing was to discover new genes. I've already isolated 16S reads from my sample using SortMeRNA, but it seems like softwares that do OTU picking, taxonomic assignment and diversity analyses (such as mothur and QIIME) require that all the reads come from the same region of the 16S gene.

Is there a way of using these 16S reads that I've filtered using SortMeRNA in a diversity analysis using mothur/QIIME?

metagenome taxonomy mothur qiime • 3.8k views

ADD COMMENT • link updated 7.2 years ago by Brian Bushnell 20k • written 7.2 years ago by Antonio Camargo ▴ 160

0

Entering edit mode

Cross-posted on StackExchange

ADD REPLY • link 7.2 years ago by Antonio Camargo ▴ 160

score 1 · Answer 1 · 2017-08-22

1

Entering edit mode

7.2 years ago

Bioinformatics_NewComer ▴ 330

If you've access to computing cluster then metagenomic tools like CLARK-S would be good to try. They give you abundances and allow you to perform other analyses.

ADD COMMENT • link 7.2 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

Thank you. For what I've read, tools like CLARK, KRAKEN and Kaiju are the answer for my problem.

ADD REPLY • link 7.2 years ago by Antonio Camargo ▴ 160

score 1 · Answer 2 · 2017-08-22

You could try assembling your 16S reads with an assembler that deals well with branches (possibly SPades), then aligning the resulting assemblies to other 16S sequences and trimming off the bases that go off the end (and are thus not 16S bases). I'm not sure how well that would work; depends on the data.

But I think your best bet would be to use the shotgun data as shotgun data, instead of trying to shoehorn it into 16S-based tools. We commonly assemble the whole metagenome, map the reads to the assembly to calculate coverage, and then align the contigs to existing databases like RefSeq to find out what they are. Once you know that contig_123 maps to E.coli, and has coverage of 43x, you can say you probably have 43x coverage of E.coli in your data. Whether this approach works depends on whether you have enough data to assemble; if only, say, 10% of your reads map to the assembly, then it's pretty much a failure and you'll need a different method.

One thing to try in that case is to compare reads directly to RefSeq to find what organisms they came from. You can get a list of organisms observed in your data with BBMap like this:

sendsketch.sh in=data.fq refseq records=400

Once you know which organisms are present, you can download their genomes and map reads to them for quantification purposes. Mapping to all of refseq directly normally takes too much time or memory to be practical.

You might also check out KRAKEN which looks like it is designed for this purpose. I have not tried it, though.