So I am doing some research focusing on certain SNPs of importance found within metagenomes. For example, within the human gut from data provided by the Human Microbiome Project (HMP). My questing is how does the process of contig assembly affect SNPs? Will some SNPs be masked by the assembly process when overlapping contigs are joined to make a longer sequence or when these sequences are assigned phylogeny based on a reference genome? HMP uses the SOAPdenovo process. I have tried to determine the answer to this question but it still remains a little unclear. Does anyone have any knowledge or experience with this? Also, if assembly is a problem in regards to SNPs, can someone suggest the best way to BLAST unassembled metagenomic data and perhaps the best source of such data? Thanks in advance for any help.
Question: How Does The Metagenome Assembly Process Affect Snps?
6.8 years ago by
weslfield • 90
weslfield • 90 wrote:
ADD COMMENT • link •
6.8 years ago by
umer.zeeshan.ijaz • 1.8k
umer.zeeshan.ijaz • 1.8k wrote:
A few comments:
- Contigs dont overlap and you use short-reads to generate contigs.
- To connect contigs together, you sequence longer-reads called mate pairs (kb) to generate scaffolds (This is useful to circumvent problem of repeated regions)
- A good way to deal with metagenomic contigs is to bin them together and do analysis on clustered contigs (no need to go for scaffolding). We have recently submitted a software CONCOCT (initial version in arxiv details the steps involved: http://arxiv.org/abs/1312.4038 )
- You can use single-copy genes (COGs) as a proxy for metagenomic assembly. In the paper above we have identified 36 such genes that should occur once. If you can observe all of them then it implies you are more or less recovering the whole genome. I have shared two scripts that you can use with your annotated contigs from PROKKA ( http://www.vicbioinformatics.com/software.prokka.shtml ): They sort of act as add-ons (please also consult my annotation page: ( http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/annotation.html )
- Regarding your BLAST question, you can use MEGAN ( http://ab.inf.uni-tuebingen.de/software/megan/ ) on your blasted reads and phylogenetic analysis is also supported in MEGAN. If you have contigs, you can use my software TAXAassign ( http://www.github.com/umerijaz/TAXAassign ) to assign your contigs to NCBI taxonomy at different taxnomic levels: Phylum, Class, Order, Family, Genus, and Species) which is used in CONCOCT paper to validate binning
- With our collaborators we are also developing a metagenomic assembly pipeline. Maybe worth giving it a look as well (We are still in the process of writing it up): https://github.com/inodb/metassemble
- As for SNPs, I am not really sure about their impact on assembly quality. Though while working with assembly of a single genome, for high-coverage data, because of sequencing errors, we ended up with a smaller N50 score and had to resort to subsample our data to improve the assembly quality ( http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html?#SUBSAMPLINGVELVET ). We didn't need to subsample when we used SPADES ( http://bioinf.spbau.ru/spades ) as it uses BayesHammer to correct errors. I'll come back to SNPs question when I'll have a sound argument.
Best Wishes, Umer
ADD COMMENT • link
Please log in to add an answer.
Powered by Biostar version 2.3.0
Traffic: 1274 users visited in the last hour