Hi all! I'm currently doing an analysis project on some human microbiota data from NIH Human Microbiome Project. I wish to search the microbiome data for the presence of certain bacteria and their relative amount, using the reference bacterial genome provided in the NIH Human Microbiome Project website. To be a little more specific, I want to map the microbiome sequence reads to a specific bacterial reference genome with 97% sequence similarity(exclude 16S). Can anyone recommend me a tool or a software that can accomplish this? Thank you!!
See the links below:
This is the main site:
Analyzing the Human Microbiome: A "How To" Guide for Physicians:
It is based on this paper:
Better understand the role of microbes in human health and disease:
Old, but highly cited paper:
Human Microbiome Analysis
Some people looked even here:
Ancient human microbiomes
BBMap has a couple relevant flags - idfilter and idtag. For example:
bbmap.sh in=reads.fq outm=mapped.sam ref=genomes.fa idfilter=0.97
...will only map the reads with at least 97% identity, and "outm" rather than "out" means only the mapped reads will be printed to the sam file. It will also add a field to the output indicating each read's identity to the reference.
Excluding 16S is a bit more difficult, though you can do that by masking ribosomal sequence prior to mapping. You can do that with BBMask. Alternatively, you could filter ribosomal reads from your data prior to mapping, using some database like Silva.
Typically, though, if you want to quantify the proportion of reads that came from various different references, I'd suggest using BBSplit or Seal, both of which are in the BBMap package. Seal is faster and easier to use; BBSplit is more accurate and uses less memory.
Here is the recent paper for accurate read assignment to organisms.
They have developed GOTTCHA.
GOTTCHA is an application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly smaller false discovery rates (FDR).