Question

Verifying/Investigating Frameshifts On A De Novo Assembly

2

Entering edit mode

10.5 years ago

Tancata ▴ 210

We're doing some sequencing and de novo assembly to compare the genomes of a globally distributed eukaryotic parasite in different populations. We've come up with a potentially interesting pattern - one population seems to have a larger genome size (8Mb vs 6Mb), more duplicate genes, and more inferred frameshifts in the assembly than the others ("normal" population has about 10 per genome, these genomes have about 300-400).

This could have an interesting biological explanation, but I'm also worried that we're just doing the assembly wrong for this population. (We are using Quake to error-correct reads, followed by SPAdes for assembly, having tried various alternatives and found this to be the best).

I'm trying to work out if the frameshifts are real or an assembly artifact. The relevant data seem to be:

We have several isolates from the population, and all of them show the frameshifts at the same positions (they are very similar genomes overall).
When I map the reads back to the de novo assembly and look at the frameshift positions, there seems to be good support for them - coverage is good (30-40x, which is similar to the overall coverage), and there are many reads which contain the whole insertion or deletion which has caused the frameshift.

For example:

enter image description here And the reads mapping to that region (showing pretty even coverage across the deletion):

enter image description here

There is one case where there is an insertion which causes a frameshift, with 22 reads supporting the insertion and 4 with a gap; but other than this, they seem to look real.

Does this seem convincing, or are there further tests I could do with the reads/assembly to investigate further? Are there possible sequencing or assembly artifacts that would explain this?

I guess the best test would be to PCR over these regions from the original DNA, but I'd like to do the most thorough bioinfomatic analysis possible before asking a wet lab person to do that.

Thanks a lot!

ngs assembly • 2.7k views

ADD COMMENT • link updated 10.5 years ago by Adrian Pelin ★ 2.6k • written 10.5 years ago by Tancata ▴ 210

score 1 · Answer 1 · 2013-11-02

Hello,

Interesting project, very similar to what I do as well.

Few questions come to mind: 1) How can you be sure the difference in genome size is not due to extra contaminants present in some samples but not others? Did you culture these isolates before sequencing, or extracted them from host and then sequenced it?

2) What NGS did you use? I assume illumina, from how clean your reads are and from using SPAdes, but what was your library design? MiSeq or HiSeq, paired or single end?

3) I have the exact same of indels in my project from different populations, I assembled with MIRA, Velvet and SPAdes, it looks to me like indels are real. What I did, is I took 2 contigs, one where the deletion is present, and one where the deletion is absent, and mapped reads stringently to them (as in reads have to have 100% identity to reference). Than look at coverage, you expect to see a drop at that position in both cases.

Keep in mind, you are working with eukaryotes, they are unlikely to be haploid. So one allele has indel and one doesn't isn't out of the question.