Question

Looking for a tool for analyzing human microbiome data

2

Entering edit mode

7.8 years ago

benzhang ▴ 20

Hi all! I'm currently doing an analysis project on some human microbiota data from NIH Human Microbiome Project. I wish to search the microbiome data for the presence of certain bacteria and their relative amount, using the reference bacterial genome provided in the NIH Human Microbiome Project website. To be a little more specific, I want to map the microbiome sequence reads to a specific bacterial reference genome with 97% sequence similarity(exclude 16S). Can anyone recommend me a tool or a software that can accomplish this? Thank you!!

genome alignment • 3.0k views

ADD COMMENT • link updated 7.8 years ago by Naren ▴ 990 • written 7.8 years ago by benzhang ▴ 20

score 4 · Answer 1 · 2016-07-13

See the links below:

This is the main site:

http://hmpdacc.org/

Its tools:

http://hmpdacc.org/resources/tools_protocols.php

Analyzing the Human Microbiome: A "How To" Guide for Physicians:

http://www.medscape.com/viewarticle/828715_3

It is based on this paper:

http://www.nature.com.sci-hub.cc/ajg/journal/v109/n7/full/ajg201473a.html

Better understand the role of microbes in human health and disease:

http://www.illumina.com/areas-of-interest/microbiology/human-microbiome-analysis.html

Disease cases:

https://commonfund.nih.gov/hmp/programhighlights

Old, but highly cited paper:

Human Microbiome Analysis

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002808

Some people looked even here:

Ancient human microbiomes

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4312737/

score 2 · Answer 2 · 2016-07-13

2

Entering edit mode

7.8 years ago

Brian Bushnell 20k

BBMap has a couple relevant flags - idfilter and idtag. For example:

bbmap.sh in=reads.fq outm=mapped.sam ref=genomes.fa idfilter=0.97

...will only map the reads with at least 97% identity, and "outm" rather than "out" means only the mapped reads will be printed to the sam file. It will also add a field to the output indicating each read's identity to the reference.

Excluding 16S is a bit more difficult, though you can do that by masking ribosomal sequence prior to mapping. You can do that with BBMask. Alternatively, you could filter ribosomal reads from your data prior to mapping, using some database like Silva.

Typically, though, if you want to quantify the proportion of reads that came from various different references, I'd suggest using BBSplit or Seal, both of which are in the BBMap package. Seal is faster and easier to use; BBSplit is more accurate and uses less memory.

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much Brian!

ADD REPLY • link 7.8 years ago by benzhang ▴ 20

0

Entering edit mode

Hey, Brian! I've been using bbmap and it works great, however, I'm looking into different metagenomic data and many different bacterial species at the same time. I was wondering if I'm able to create a custom database of all the bacterial genomes, so that I can map all the species to a metagenomic read? Thanks!

Best, Ben

ADD REPLY • link 7.7 years ago by benzhang ▴ 20

0

Entering edit mode

Hi Ben,

You can simply concatenate all of the fasta references together, and map the reads to the combined file with BBMap. The sam output will indicate which reference each read mapped to. Does that answer your question?

-Brian

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi, Brian! Thanks for getting back to me. Do I just use Cat command line tool? Thanks!

ADD REPLY • link 7.7 years ago by benzhang ▴ 20

0

Entering edit mode

Yep, cat will work fine.

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi, Brian. Sorry, one more question. Can you teach me the command line I should use if I want to quantify the relative amount mapped to the reads with BBSplit? Thank you so much!

ADD REPLY • link 7.7 years ago by benzhang ▴ 20

0

Entering edit mode

More complete instructions are in the shellscript bbsplit.sh), if you open it with a text editor or run it with no arguments. But in general, when you have multiple reference fasta files (for example, x.fa and y.fa) you would do this:

bbsplit.sh ref=x.fa,y.fa in=reads.fq basename=out_%.fq  outu=unmapped.fq refstats=refstats.txt

This will produce three files: out_x.fq, out_y.fq, and unmapped.fq, each containing the reads that mapped best to that reference (or for unmapped, did not map to any reference). Refstats.txt will give counts of reads that mapped to each reference.

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

score 1 · Answer 3 · 2016-07-15

1

Entering edit mode

7.8 years ago

Naren ▴ 990

Here is the recent paper for accurate read assignment to organisms.
http://nar.oxfordjournals.org/content/43/10/e69.full.pdf+html

They have developed GOTTCHA.
GOTTCHA is an application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly smaller false discovery rates (FDR).

ADD COMMENT • link 7.8 years ago by Naren ▴ 990

0

Entering edit mode

Thank you, Nari!

ADD REPLY • link 7.8 years ago by benzhang ▴ 20

score 0 · Answer 4 · 2016-07-14

0

Entering edit mode

7.8 years ago

benzhang ▴ 20

Thank you for the links, Natasha, and Thank you for the tools, Brian! I really appreciate your help!

ADD COMMENT • link 7.8 years ago by benzhang ▴ 20

0

Entering edit mode

If you found what you need then please accept an answer (or wait for more if not completely satisfied)

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k