Question

Help improving Ion Torrent de novo assembly

7

Entering edit mode

9.3 years ago

pedroivo000 ▴ 110

Hey guys,

I need some help improving a de novo assembly of a prokaryote genome made using Ion Torrent reads (mean length = 230 bp, mean quality Q = 29). The assembly was originally made with MIRA 4, resulting in 519 contigs (~ 7.8Mb). The DNA sample was contaminated with a symbiont prokaryote, so the 7.8Mb corresponded to 3Mb of my genome of interest and the rest to the contaminant genome. I was able to filter out the contigs form the contaminant genome using BLAST of all contigs against a closely related species, with a fully sequenced genome. In the end, I got a "final" assembly of 165 contigs with a total length of 2,85 Mb (N50 27895 bp).

I would like improve the assembly by increasing the N50 value or extending the contigs in someway, but I am stuck now. This is my (not so much) progress so far:

I mapped the Ion Torrent reads to the contaminant genome using bwa mem (all default parameters) and used samtools to get only the unmapped reads (the reads that would belong to the genome I want to assemble)
Used SPADES to reassemble the unmapped reads using the 165 contigs as trusted contigs for gap closure, repeat resolution and graph construction (--trusted-contigs option)

However, SPADES keeps crashing when trying to assemble the unmapped reads. I believe the read coverage is too low now to do anything. Using all the reads from the sequencing during the assembly with SPADES generates 8447 contigs, which is not really an improvement from the first MIRA assembly.

So, after this long explanation of my problems, here comes the question:

What do you guys usually do to improve a primary assembly in a situation like this? I am looking for a tool that could be used to use the reads to extend the contigs or scaffold them, or any sort of strategy that could be used in this case. I am trying to use the SPADES error correction tool to improve the quality of the reads so I can remap then to the contaminant genome hoping to have more unmapped reads to redo the SPADES assembly step. Would the error correction be a good strategy in this case? Also, is BWA a good mapping tool for Ion Torrent reads?

Sorry if the questions seem dumb, I am a newbie in the genome assembly world. At least now I don't have to pray for the PCR gods to make something work.

Assembly ion-torrent next-gen genome • 5.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by pedroivo000 ▴ 110

Ram · Accepted Answer · 2015-02-10

3

Entering edit mode

9.3 years ago

crashfrog ▴ 40

I'm not sure much improvement is possible; you have roughly the number of contigs and N50 that I would expect from single-end IonTorrent data on a 2.8 Mbase genome. Our IonTorrent assemblies range anywhere between 50 to 15000 contigs, depending on the coverage and quality of the data. No short-read (sub-5000 or so) sequencing technology is going to result in complete assemblies, because it's theoretically impossible to resolve tandem repeats in the genome if no single read is long enough to span the repeat.

I guess I can't tell you if your assembly is "good enough" without knowing what you're using it for downstream, but in my group we would consider that assembly good enough for variant calling, phylogenetic clustering, even annotation. We probably wouldn't use it as a reference, but that's what PacBio is for.

What you really want to look out for is overassembly or structural misassembly, which is an increasing risk as you push towards a higher N50; if a good reference genome for your organism exists, it's worth running QUAST (http://bioinf.spbau.ru/quast) to see if your contigs are right and cover the entire genome. And without aligning to a reference there's no other informational content in your reads that would allow you to scaffold your contigs.

ADD COMMENT • link 9.3 years ago by crashfrog ▴ 40

1

Entering edit mode

Thanks for the answer! My final objective is to publish a draft genome for this species (a fresh-water cyanobacteria). I did the automatic annotation of the genome using RAST and I got what I believe is a reasonable number of features (around 3200). I'll try to use QUAST to evaluate the position of the contigs for the MIRA assembly. However, this is the first sequenced genome (as far as I am aware of) this species in a Amazonian environment, so genomic comparison analysis shows that the genome I assembled is fairly dissimilar to what RAST believes to be the closest related organism. So, I can't quite use a reference to evaluate the assemblies :( I do see some very high identities when using BLAST to identify the 16S sequences, but there is no reference genome for those closely-related species.

ADD REPLY • link 9.3 years ago by pedroivo000 ▴ 110

1

Entering edit mode

I get pretty good results with QUAST even using fairly dissimilar genomes as a reference. Also, what's cool about QUAST is that you can compare multiple assemblies in the same run, and often that will tell you useful things, as well. It should tell you a lot about the state of your assembly.

At any rate, not much to add except to reassure you; we've published good papers on worse assemblies than you're getting out of SPAdes. There's likely not that much more for you to massage out of your sequencing data (you've already done everything that I've ever heard anyone suggest.) If you're still not happy with the assembly and you have some budget left you might try getting it on a MiSeq with a paired-end library (we're pretty strictly MiSeq/NextSeq in my group ever since we sold off our Torrents.)

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by crashfrog ▴ 40

0

Entering edit mode

Thanks, man! I guess I'll have to let it go (please somebody get this song out of my head). The chances of a new sequencing for this strain happening sometime soon are remote, so I'll move on. Unless some wild tool appears and solve my problems :)

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by pedroivo000 ▴ 110