Hey guys,
I need some help improving a de novo assembly of a prokaryote genome made using Ion Torrent reads (mean length = 230 bp, mean quality Q = 29). The assembly was originally made with MIRA 4, resulting in 519 contigs (~ 7.8Mb). The DNA sample was contaminated with a symbiont prokaryote, so the 7.8Mb corresponded to 3Mb of my genome of interest and the rest to the contaminant genome. I was able to filter out the contigs form the contaminant genome using BLAST of all contigs against a closely related species, with a fully sequenced genome. In the end, I got a "final" assembly of 165 contigs with a total length of 2,85 Mb (N50 27895 bp).
I would like improve the assembly by increasing the N50 value or extending the contigs in someway, but I am stuck now. This is my (not so much) progress so far:
I mapped the Ion Torrent reads to the contaminant genome using bwa mem (all default parameters) and used samtools to get only the unmapped reads (the reads that would belong to the genome I want to assemble)
Used SPADES to reassemble the unmapped reads using the 165 contigs as trusted contigs for gap closure, repeat resolution and graph construction (
--trusted-contigs
option)
However, SPADES keeps crashing when trying to assemble the unmapped reads. I believe the read coverage is too low now to do anything. Using all the reads from the sequencing during the assembly with SPADES generates 8447 contigs, which is not really an improvement from the first MIRA assembly.
So, after this long explanation of my problems, here comes the question:
What do you guys usually do to improve a primary assembly in a situation like this? I am looking for a tool that could be used to use the reads to extend the contigs or scaffold them, or any sort of strategy that could be used in this case. I am trying to use the SPADES error correction tool to improve the quality of the reads so I can remap then to the contaminant genome hoping to have more unmapped reads to redo the SPADES assembly step. Would the error correction be a good strategy in this case? Also, is BWA a good mapping tool for Ion Torrent reads?
Sorry if the questions seem dumb, I am a newbie in the genome assembly world. At least now I don't have to pray for the PCR gods to make something work.
Thanks for the answer! My final objective is to publish a draft genome for this species (a fresh-water cyanobacteria). I did the automatic annotation of the genome using RAST and I got what I believe is a reasonable number of features (around 3200). I'll try to use QUAST to evaluate the position of the contigs for the MIRA assembly. However, this is the first sequenced genome (as far as I am aware of) this species in a Amazonian environment, so genomic comparison analysis shows that the genome I assembled is fairly dissimilar to what RAST believes to be the closest related organism. So, I can't quite use a reference to evaluate the assemblies :( I do see some very high identities when using BLAST to identify the 16S sequences, but there is no reference genome for those closely-related species.
I get pretty good results with QUAST even using fairly dissimilar genomes as a reference. Also, what's cool about QUAST is that you can compare multiple assemblies in the same run, and often that will tell you useful things, as well. It should tell you a lot about the state of your assembly.
At any rate, not much to add except to reassure you; we've published good papers on worse assemblies than you're getting out of SPAdes. There's likely not that much more for you to massage out of your sequencing data (you've already done everything that I've ever heard anyone suggest.) If you're still not happy with the assembly and you have some budget left you might try getting it on a MiSeq with a paired-end library (we're pretty strictly MiSeq/NextSeq in my group ever since we sold off our Torrents.)
Thanks, man! I guess I'll have to let it go (please somebody get this song out of my head). The chances of a new sequencing for this strain happening sometime soon are remote, so I'll move on. Unless some wild tool appears and solve my problems :)