Poor Output from Spades with High Coverage Input Data
1
0
Entering edit mode
6.0 years ago
callamartyn ▴ 10

Hi all,

I am testing out some different methods for assembling viral genomes from Illumina data and am having some surprising results from SPAdes. I have mapped the reads to a reference genome in Geneious for comparison and can see that my reads cover 99.1% of the 10 kb genome with an average depth of 11,258 (I sequenced very deeply and enriched the library for viral reads). So I assumed there should be more than enough data for SPAdes to output the entire genome.

However, when I run SPAdes (in paired end mode) two unusual things happen. First, it is not able to assemble the entire viral genome or even any substantial contigs of it. When I map the scaffolds back to the reference genome, I only have about 46% coverage. I can bring this up to 85% by using the "trusted contig" option but this is still well below the 99% I get from mapping all the reads directly. Does anyone have an idea why this might be the case or where I should start looking for the problem? I know SPAdes works for many people and the data I am inputting seems like it should be more than sufficient to get back a full genome.

Second, when I map the scaffolds back to the reference I can see that many of them overlap with each other substantially. Can anyone explain why they wouldn't be joined into a larger contig/scaffold? And are there any options I can add in SPAdes to join them?

Would appreciate any direction anyone can suggest to figure out what is going wrong. Thanks in advance!

assembly genome • 2.4k views
ADD COMMENT
0
Entering edit mode

Not answering your question but you may want to give tadpole.sh from BBMap suite a try. It works well with viral genomes. Since you have way too much coverage consider normalizing your data using bbnorm.sh (see guide above).

ADD REPLY
0
Entering edit mode

Thanks so much, I will definitely try the normalization!

ADD REPLY
0
Entering edit mode

Is this DNAseq or RNAseq?

ADD REPLY
0
Entering edit mode

different methods for assembling viral genomes

Looks like DNAseq. Unless these are RNA virii.

ADD REPLY
0
Entering edit mode

RNASeq; viral RNA that was reverse transcribed and prepped with a Nextera kit

ADD REPLY
0
Entering edit mode

Reduce your coverage. De Bruijn graph assemblers can choke on very high coverage. You can sub sample your fast as randomly with several tools.

ADD REPLY
1
Entering edit mode

Thanks so much! I tried a few different amounts of reads and got it to work. For anyone else who encounters this problem, I tried 20,000, 500,000, and 1 million reads that had already undergone host-subtraction. For my particular data 500,000 produced the best assembly (for a 10 kb genome) and it was definitely deteriorating by 1 million.

ADD REPLY
0
Entering edit mode

What coverage was that equivalent to in the end?

ADD REPLY
0
Entering edit mode
6.0 years ago
h.mon 35k

For RNAseq of RNA viruses, I had good results (meaning complete viral genomes) with Trinity + CAP3. Indeed, the initial Trinity assembly very often is fragmented, but the second CAP3 assembly step fixes this.

There are several quality control / filtering steps that may increase the quality of the assembly, have a look at the metaViC pipeline (announced here: Tools for viral metagenomics profiling and abundance estimation using BAM file ) for ideas, or use it. In particular, I would recommend very aggressive adapter and quality trimming, as there is plenty of coverage.

ADD COMMENT
0
Entering edit mode

Thanks so much, I finally got spades to work but am curious about metaVIC and will try it as well!

ADD REPLY

Login before adding your answer.

Traffic: 1377 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6