SPADES Result Comparison
0
0
Entering edit mode
4 weeks ago
Umer ▴ 100

Hi All,

I am working on denovo genome assembly of a FUngal sample. The raw fastq data is PE150 at 100x coverage.

I performed denovo genome assembly using SPAdes v4 in 4 different combinations.

  1. SPAdes denovo with --careful option
  2. SPAdes denovo with --isolate option
  3. SPAdes with --trusted-contigs and --careful option
  4. SPAdes with --trusted-contigs and --isolate option

I used --isolate option as in previous run with --careful option, there was warning that "data has HIGH uniform coverage, recomended to use option --isolate" previous post

the command i used for spades along with other parameters is

$spades -o $spades_out -1 ILL_1.fq.gz -2 ILL_2.fq.gz --careful --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$spades -o $spades_out -1 ILL_1.fq.gz -2 ILL_2.fq.gz --isolate --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$spades -o $spades_out2 -1 ILL_1.fq.gz -2 ILL_2fq.gz --trusted-contigs $Reference --careful --threads 14 --memory 240 -k 21 33 55 77 99 111 127
$spades -o $spades_out2 -1 ILL_1.fq.gz -2 ILL_2fq.gz --trusted-contigs $Reference --isolate --threads 14 --memory 240 -k 21 33 55 77 99 111 127

$Reference contailns a reference fasta file which is a chromosome level assembly for Fusarium oxysporum. removed all unplaced contigs and only kept the chromosomes.

after running Quast on the scaffold.fasta generated in all 4 combinations i get the following results

QUast Results

My initial assesment is that the denovo assembly with --careful option is generating a better assembly as it has less contigs# and bigger N50 value.

for ref-Guided assembly, I am a bit shocked as it assembled more contigs which i did not expected. (if you can share the reason that will be helpful)

I want to know your opinion on this and which method I should use for analyzing the remaining samples.

illumina genome SPAdes assembly • 560 views
ADD COMMENT
0
Entering edit mode

AFAIK there is no reason not to use --careful pretty much all the time, but if there is low risk of contamination I guess that's where --isolate comes in. Looking at the results, isolate is clearly helping, but isn't making much difference to your contig lengths.

Generally, whatever gives you the highest N50s is going to likely be the best (or at least most immediately _useful_) genome.

I'm not sure what's going on with the ref-guided other than the added information is probably allowing data that could not otherwise be resolved satisfactorily to be incorporated into new contigs. You'll have to do a bit more investigation, but it may well be the case that a number of those new/additional contigs are largely duplications.

A few other angles/questions:

  • Is the genome haploid?
  • Are there any extra-chromosomal DNA sources in the reference that might be allowing assembly of otherwise hard-to-resolve contigs?

100X coverage isn't crazy-high, but sometimes it can be worth experimenting with downsampling the coverage too.

ADD REPLY
0
Entering edit mode

HI, Thankyout for your helpful insight.

  • Yes the genome is haploid.
  • The reference I used only contains 15 chromosomes. 11 are core chromosomes and 4 are accessory chromosomes. (I removed every other unplaced contigs from the fasta file)

Can you shed some light on down-sampling the coverage? Like how to to that ?

ADD REPLY
0
Entering edit mode

Since it's haploid, you might also want to try shovill (https://github.com/tseemann/shovill).

It's intended for bacteria, but it does say it will work for other small microbes as long as they're haploid. Your genome might be too big, but worth a try.

Downsampling is basically just picking a subset of reads at random to reduce the overall coverage. For bacterial genomes, 30X is the pretty widely accepted 'norm'. There are a few tools that can do it or you can write something yourself. Have a look on this forum for other posts about downsampling.

ADD REPLY
0
Entering edit mode

Id have a quick look at BUSCO scores before judging these genomes

ADD REPLY
0
Entering edit mode

Hi,

How long did it take to run the whole process with this command?

$spades -o $spades_out -1 ILL_1.fq.gz -2 ILL_2.fq.gz --isolate --threads 14 --memory 240 -k 21 33 55 77 99 111 127

Regards,

ADD REPLY
0
Entering edit mode

I didn't keep the log files but as far as i remember it was around 5 6 hours.

Recently i generated a hybrid assebly isng SPADES with 10 threads using both --isolate and --careful options

Illumina data was same as the command above ILL11 + Nanopore Data

For --Isolate it took 485 minutes (~8 hr) at 10 threads

For --careful it took 660 minutes (~11 hr) at 10 threads

ADD REPLY

Login before adding your answer.

Traffic: 2682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6