Hi all,
I am testing out some different methods for assembling viral genomes from Illumina data and am having some surprising results from SPAdes. I have mapped the reads to a reference genome in Geneious for comparison and can see that my reads cover 99.1% of the 10 kb genome with an average depth of 11,258 (I sequenced very deeply and enriched the library for viral reads). So I assumed there should be more than enough data for SPAdes to output the entire genome.
However, when I run SPAdes (in paired end mode) two unusual things happen. First, it is not able to assemble the entire viral genome or even any substantial contigs of it. When I map the scaffolds back to the reference genome, I only have about 46% coverage. I can bring this up to 85% by using the "trusted contig" option but this is still well below the 99% I get from mapping all the reads directly. Does anyone have an idea why this might be the case or where I should start looking for the problem? I know SPAdes works for many people and the data I am inputting seems like it should be more than sufficient to get back a full genome.
Second, when I map the scaffolds back to the reference I can see that many of them overlap with each other substantially. Can anyone explain why they wouldn't be joined into a larger contig/scaffold? And are there any options I can add in SPAdes to join them?
Would appreciate any direction anyone can suggest to figure out what is going wrong. Thanks in advance!
Not answering your question but you may want to give
tadpole.sh
from BBMap suite a try. It works well with viral genomes. Since you have way too much coverage consider normalizing your data usingbbnorm.sh
(see guide above).Thanks so much, I will definitely try the normalization!
Is this DNAseq or RNAseq?
Looks like DNAseq. Unless these are RNA virii.
RNASeq; viral RNA that was reverse transcribed and prepped with a Nextera kit
Reduce your coverage. De Bruijn graph assemblers can choke on very high coverage. You can sub sample your fast as randomly with several tools.
Thanks so much! I tried a few different amounts of reads and got it to work. For anyone else who encounters this problem, I tried 20,000, 500,000, and 1 million reads that had already undergone host-subtraction. For my particular data 500,000 produced the best assembly (for a 10 kb genome) and it was definitely deteriorating by 1 million.
What coverage was that equivalent to in the end?