Question

How Does The Metagenome Assembly Process Affect Snps?

0

Entering edit mode

10.0 years ago

weslfield ▴ 90

So I am doing some research focusing on certain SNPs of importance found within metagenomes. For example, within the human gut from data provided by the Human Microbiome Project (HMP). My questing is how does the process of contig assembly affect SNPs? Will some SNPs be masked by the assembly process when overlapping contigs are joined to make a longer sequence or when these sequences are assigned phylogeny based on a reference genome? HMP uses the SOAPdenovo process. I have tried to determine the answer to this question but it still remains a little unclear. Does anyone have any knowledge or experience with this? Also, if assembly is a problem in regards to SNPs, can someone suggest the best way to BLAST unassembled metagenomic data and perhaps the best source of such data? Thanks in advance for any help.

metagenomics snp assembly ncbi • 3.0k views

ADD COMMENT • link updated 10.0 years ago by umer.zeeshan.ijaz ★ 1.8k • written 10.0 years ago by weslfield ▴ 90

0

Entering edit mode

Accurately assembling a single genome can be a challenge. Assembling metagenomes is even more challenging. What exactly happens when a metagenome gets dotted with SNPs .. This would be a fun project to simulate and post some in-silico results. I wish I had fewer responsibilities and would do it myself. Alas career advancements usually mean that we can work less and less on problems that are fun.

ADD REPLY • link 10.0 years ago by Istvan Albert 100k

score 3 · Answer 1 · 2014-04-10

A few comments:

Contigs dont overlap and you use short-reads to generate contigs.
To connect contigs together, you sequence longer-reads called mate pairs (kb) to generate scaffolds (This is useful to circumvent problem of repeated regions)
A good way to deal with metagenomic contigs is to bin them together and do analysis on clustered contigs (no need to go for scaffolding). We have recently submitted a software CONCOCT (initial version in arxiv details the steps involved: http://arxiv.org/abs/1312.4038 )
You can use single-copy genes (COGs) as a proxy for metagenomic assembly. In the paper above we have identified 36 such genes that should occur once. If you can observe all of them then it implies you are more or less recovering the whole genome. I have shared two scripts that you can use with your annotated contigs from PROKKA ( http://www.vicbioinformatics.com/software.prokka.shtml ): They sort of act as add-ons (please also consult my annotation page: ( http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/annotation.html )
- http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/PROKKA_RPSBLAST.sh
- http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/PROKKA_CDD.py
Regarding your BLAST question, you can use MEGAN ( http://ab.inf.uni-tuebingen.de/software/megan/ ) on your blasted reads and phylogenetic analysis is also supported in MEGAN. If you have contigs, you can use my software TAXAassign ( http://www.github.com/umerijaz/TAXAassign ) to assign your contigs to NCBI taxonomy at different taxnomic levels: Phylum, Class, Order, Family, Genus, and Species) which is used in CONCOCT paper to validate binning
With our collaborators we are also developing a metagenomic assembly pipeline. Maybe worth giving it a look as well (We are still in the process of writing it up): https://github.com/inodb/metassemble
As for SNPs, I am not really sure about their impact on assembly quality. Though while working with assembly of a single genome, for high-coverage data, because of sequencing errors, we ended up with a smaller N50 score and had to resort to subsample our data to improve the assembly quality ( http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html?#SUBSAMPLINGVELVET ). We didn't need to subsample when we used SPADES ( http://bioinf.spbau.ru/spades ) as it uses BayesHammer to correct errors. I'll come back to SNPs question when I'll have a sound argument.

Best Wishes, Umer