Tools For De Novo Assembly For Exome Sequencing Data
3
4
Entering edit mode
9.9 years ago
michealsmith ▴ 780

I'm using some tools such as Pindel to call structural variants from exome data. Since exome is sparse region with limited information, I'm just looking for those large indels (say 200bp), which is small enough to may have breakpoints within exomes, and big enough to be missed by SNP-centric algorithms like GATK. Due to BWA's inability to well handle multiple-alignment in low-complexity region, I'm trying to do de novo assembly around all called breakpoints using Abyss, in order to exlude possible false positives.

This propels me to think: why not simply assemble the whole exome?

My question:

1. I know some available assemblers like Abyss, Velvet. But any algorithm specifically calling variants based on de novo assembly?
2. How much RAM do I need to assemble the whole exome?
3. Any tricks for assembling exome sequences which makes it different for whole genome? I mean would it be desirable to put exome sequences into those algorithms designed for whole-genome assembly?

many thanks

assembly • 5.9k views
3
Entering edit mode
9.9 years ago
matted 7.6k

I haven't used it myself, but there is an assembler tailored to finding variation from read data (so this answers your question #1). They don't specifically talk about exomes, but you can read it and decide if their assumptions are flexible enough for the changed data.

"De novo assembly and genotyping of variants using colored de Bruijn graphs" http://www.nature.com/ng/journal/v44/n2/full/ng.1028.html

0
Entering edit mode

many thanks, this is Cortex! I've noticed this paper some while ago, just don't have time reading

1
Entering edit mode

Hi there. Gerry has also just posted this on the Cortex google group (search for cortex_var and google group), where I've posted an extensive reply. Essentially the answer is yes, sure. You should only need around 3Mb of RAM to do an exome, and Cortex has tools to build an assembly graph, call variants and dump a VCF. Cortex has mostly been used on whole genome sequencing data, and so the new automated pipeline (which allows you to call at many kmers and take the union of calls made at different k - note this is unlike when people try to do whole genome consensus assembly and try to choose the "optimal" kmer - for variant calling there fundamentally is no optimal k, so the best you can do is either vary kmer size (Cortex doesnt do this) or run at many kmers and take the union of all your calls) is not really tailored for exome data. Specifically the automatic error cleaning option won't work well on exome data and you will need to do that step yourself. I give some details of how to do this at the Cortex google group, and I'll probably post about it in more detail in the future,

cheers

Zam

0
Entering edit mode

Hi - one more comment as a result of further questions from Gerry. Cortex uses a couple of statistical models, one for genotyping and one for classification of putative sites as either polymorphic, repeat or error. The former uses a Poisson model for read distribution, and was designed for Whole genome data. So, I would not expect it to be well suited to exome data (though I haven't measured its genotyping accuracy of exome data). However, that doesn't stop you discovering variants, it's just the genotyping step that uses that model. If you just do discovery, you get a VCF where the sample columns/fields just have coverage on each allele; you could try to do your own genotyping on the basis of this, but you do need to be a little careful - since Cortex can call very long alleles, they can sometimes share homology, and so you can get coverage on both alleles that is due to the shared sequence. ie things might look heterozygous because both alleles have coverage, but actually you need to look at the points where the alleles differ. Cortex does this in its genotyping step. If you want to do this with exome data and Cortex, I can help. Further details on the Cortex google group

Zam

2
Entering edit mode
9.9 years ago
JC 13k
1. I not aware of any program that can do that. ABySS produces a file "*.bubbles.fa" that represent regions with small variations like SNPs, but it's not a variant calling.
2. The memory depends on how deep and type of sequencing, a really large fastq can blow your memory.
0
Entering edit mode

Cortex won't blow your memory, even for deep exome data. I've just been looking at 200x depth data and using 3Mb of RAM.

1
Entering edit mode
9.9 years ago
erwan.scaon ▴ 900

KisSplice is a tool which detects SNPs, AS, Indels and ITRs de novo. It constructs a De Bruijn graph (similar to DbG constructed by "classicals" de novo assemblers), but instead of trying to construct contigs, it focuses on bubbles in the graph to identify polymorphisms.

Detailed informations at : http://alcovna.genouest.org/kissplice/