SPAdes did not assemble the genome completely
1
0
Entering edit mode
11 weeks ago
Flexogore • 0

Hi everyone.

I have a goal to assemble the SARS-CoV-2 having forward and reverse FASTQ reads. I have used the SPAdes tool and the best result I managed to receive is a FASTA with a bunch of scaffolds, namely 38 pieces. What should I do in order to get a full single FASTA?

SPAdes FASTQ genome assembly FASTA • 440 views
2
Entering edit mode

It is possible that you simply have way too much data (considering the small size of SARS genome). You can normalize/downsample your data and try again. Use a tool like bbnorm.sh from BBMap suite to normalize the data. Since there are so many SARS genomes available you may simply want to align your data instead of doing an assembly.

1
Entering edit mode

Use some long-read sequencing and perform a hybrid assembly, or use a reference sequence and do reference guided alignment.

You are unlikely to ever achieve a complete genome with short reads no matter what assembler you use.

0
Entering edit mode
11 weeks ago
Mensur Dlakic ★ 14k

If you have a depth of coverage that is 1000+x, it is almost a guarantee that non-random sequencing errors are causing the fragmentation in your assembly. Like GenoMax suggested, the way around that is to error-correct the data and to downsample to something like 50-100x. I know that throwing away the data sounds like a no-no, but it works.

If a total number of reads is below 30-40 million, you may want to try a true overlap assembler such as MIRA. In that case you would not need to error-correct the reads because MIRA will do it for you, but it still may be helpful to downsample the reads. I could possibly give you a better advice if you tell us the average sequence coverage in your assembly.

http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

1
Entering edit mode

Interesting to know that MIRA is still a valid option! In addition to error correction and other assembly programs, OP may try to scaffold their contigs with e.g. ragtag, which would take a SARS-COV-2 reference sequence as basis for scaffolding. I'm not too sure MIRA would be able to surpass SPAdes in terms of assembly quality though, especially when only illumina reads are being used for the assembly!

1
Entering edit mode

I'm not too sure MIRA would be able to surpass SPAdes in terms of assembly quality though, especially when only illumina reads are being used for the assembly!

I obtained a considerably better metagenome assembly with MIRA from Illumina data downsampled to 60x than from SPAdes with a full dataset. Keep in mind that here we have a single genome assembly, and I think that would work at least as good if not better. The only thing is that MIRA is very slow and memory-hungry, so it isn't an option for large datasets.