Question

SPAdes did not assemble the genome completely

1

Entering edit mode

2.7 years ago

Flexogore ▴ 10

Hi everyone.

I have a goal to assemble the SARS-CoV-2 having forward and reverse FASTQ reads. I have used the SPAdes tool and the best result I managed to receive is a FASTA with a bunch of scaffolds, namely 38 pieces. What should I do in order to get a full single FASTA?

SPAdes FASTQ genome assembly FASTA • 2.6k views

ADD COMMENT • link updated 2.4 years ago by GenoMax 141k • written 2.7 years ago by Flexogore ▴ 10

2

Entering edit mode

It is possible that you simply have way too much data (considering the small size of SARS genome). You can normalize/downsample your data and try again. Use a tool like bbnorm.sh from BBMap suite to normalize the data. Since there are so many SARS genomes available you may simply want to align your data instead of doing an assembly.

ADD REPLY • link 2.7 years ago by GenoMax 141k

1

Entering edit mode

Use some long-read sequencing and perform a hybrid assembly, or use a reference sequence and do reference guided alignment.

You are unlikely to ever achieve a complete genome with short reads no matter what assembler you use.

ADD REPLY • link 2.7 years ago by Joe 21k

score 2 · Answer 1 · 2021-08-04

2

Entering edit mode

2.7 years ago

Mensur Dlakic ★ 27k

If you have a depth of coverage that is 1000+x, it is almost a guarantee that non-random sequencing errors are causing the fragmentation in your assembly. Like GenoMax suggested, the way around that is to error-correct the data and to downsample to something like 50-100x. I know that throwing away the data sounds like a no-no, but it works.

If a total number of reads is below 30-40 million, you may want to try a true overlap assembler such as MIRA. In that case you would not need to error-correct the reads because MIRA will do it for you, but it still may be helpful to downsample the reads. I could possibly give you a better advice if you tell us the average sequence coverage in your assembly.

http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

ADD COMMENT • link 2.7 years ago by Mensur Dlakic ★ 27k

1

Entering edit mode

Interesting to know that MIRA is still a valid option! In addition to error correction and other assembly programs, OP may try to scaffold their contigs with e.g. ragtag, which would take a SARS-COV-2 reference sequence as basis for scaffolding. I'm not too sure MIRA would be able to surpass SPAdes in terms of assembly quality though, especially when only illumina reads are being used for the assembly!

ADD REPLY • link 2.7 years ago by ponganta ▴ 590

1

Entering edit mode

I'm not too sure MIRA would be able to surpass SPAdes in terms of assembly quality though, especially when only illumina reads are being used for the assembly!

I obtained a considerably better metagenome assembly with MIRA from Illumina data downsampled to 60x than from SPAdes with a full dataset. Keep in mind that here we have a single genome assembly, and I think that would work at least as good if not better. The only thing is that MIRA is very slow and memory-hungry, so it isn't an option for large datasets.

ADD REPLY • link 2.7 years ago by Mensur Dlakic ★ 27k

score 2 · Answer 2 · 2021-11-09

2

Entering edit mode

2.4 years ago

anton ▴ 70

The answer is simple: coronaSPAdes which is a part of SPAdes 3.15 release series.

ADD COMMENT • link 2.4 years ago by anton ▴ 70

0

Entering edit mode

Thank you for making us aware of coronaSPAdes. Does one need to downsample the data or will the program handle an excess of coverage internally?

ADD REPLY • link 2.4 years ago by GenoMax 141k