Question

Should I consider contigs.fa or scaffolds.fa from SPAdes output for downstream analyses?

0

Entering edit mode

4.9 years ago

bioinforesearchquestions ▴ 370

Hi,

I am currently working on the bacterial genome sequencing. This is the original post (C: Kmer selection for bacterial WGS denovo assembly using SPAdes or SOAP-denovo). I did two different analyses one with the original trimmed reads and the other with downsampled reads (using BBNorm).

Do I need to use

Downsampled dataset: ~35 million reads

     Genome assembly using SPAdes assembler 
     SPAdes Command: 
     python3 $home/bin/spades.py -o spades_out -1 Sample.R1.fastq.gz -2  Sample.R2.fastq.gz --careful

    Genome Evaluation using QUAST: 
    python3 $home/quast.py scaffolds.fasta --glimmer --use-all-alignments --rna-finding --output-dir quast_output

Original trimmed dataset: ~ 430 million reads

    Genome assembly using SPAdes assembler 
    SPAdes Command: 
    python3 $home/bin/spades.py -o spades_out -1 Sample.R1.fastq.gz -2  Sample.R2.fastq.gz --careful

   Genome Evaluation using QUAST: 
   python3 $home/quast.py scaffolds.fasta --glimmer --use-all-alignments --rna-finding --output-dir quast_output

Which is the final SPAdes output file (contig.fasta or scaffold.fasta) should be used for downstream analyses?
Should I consider K55 final_contigs/final_scaffolds or the contigs/scaffolds fasta file in the main output directory image below?

Considering the scaffold.fasta for both downsampled and original trimmed reads, I evaluated the assembly using QUAST. How to interpret the results from these tables?

Assembly SPAdes contigs QUAST • 6.9k views

ADD COMMENT • link updated 4.9 years ago by h.mon 35k • written 4.9 years ago by bioinforesearchquestions ▴ 370

score 1 · Answer 1 · 2019-06-17

1

Entering edit mode

4.9 years ago

h.mon 35k

Which is the final SPAdes output file (contig.fasta or scaffold.fasta) should be used for downstream analyses?

As you didn't use mate-pairs (at least, you didn't say you had mate-pair libraries), the contigs.fasta and the scaffolds.fasta should be almost the same. For most analyses, you should use the scaffolds.fasta.

Should I consider K55 final_contigs/final_scaffolds or the contigs/scaffolds fasta file in the main output directory image below?

No, the final assembly is at the base output folder, not in any subfolder.

Considering the scaffold.fasta for both downsampled and original trimmed reads, I evaluated the assembly using QUAST. How to interpret the results from these tables?

You had excessive coverage (I guess almost 10.000x coverage) and assemblers can not deal with this volume of data, as there will be too many errors in the reads, which will create too many unsolvable bubbles in the de Bruijn graph. Sub-sampling reduced the dataset to a more manageable size and eliminated most sequencing errors.

~~Possibly, digital normalization would result in a better assembly than just sub-sampling.~~

edit: as genomax pointed out, you probably have done digital normalization.

ADD COMMENT • link 4.9 years ago by h.mon 35k

0

Entering edit mode

Downsampled results refer to normalized data since OP used bbnorm.sh as reflected by most numbers. I don't know why N's are higher in normalized data.

bioinforesearchquestions : Were these reads completely cleaned of artifacts before being normalized?

I wonder if a further reduction in data would help. If you have the time to do it you may want to try.

ADD REPLY • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

I don't know why N's are higher in normalized data.

Probably because at the scaffolding step, some contigs could be merged due to the paired-end reads, SPAdes then fills the gaps with a small amount of Ns. The full-data assembly, however, is so fragmented that SPAdes wasn't able to link contigs into scaffolds, my guess is due to reads mapping to multiple erroneous contigs.

ADD REPLY • link 4.9 years ago by h.mon 35k

0

Entering edit mode

Hi genomax,

As requested please find the quality reports for the raw and trimmed reads used for the assembly step.

Even for BBNorm step bbnorm.sh in=reads.fq out=normalized.fq target=1000 min=30, I used the trimmed reads to get the normalized downsampled dataset.

Raw_FASTQC

Trimmed FASTQC

ADD REPLY • link 4.9 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Hi @Genomax/h.mon,

Now I have reassembled the scaffolds.fa (from SPAdes output) against a related reference genome using AlignGraph tool. I read this step will improve the assembly.

Generally what are the parameters to be considered from QUAST report for rating the assembly is good enough to proceed further for the downstream analyses?

ADD REPLY • link 4.9 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Hi @Genomax/h.mon,

Now I have reassembled the scaffolds.fa (from SPAdes output) against a related reference genome using AlignGraph tool. Then using QUAST evaluated the remaining_contigs.fasta from AlignGraph.

How to interpret this part of the result?

ADD REPLY • link 4.9 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

bioinforesearchquestions : I have not used the tools you are referring to above so can't directly assist.

In biostar slack chat with @h.mon we agreed that you are likely not going to get a single closed genome with the data you have. If that is your ultimate goal then you may want to look at alternate sequencing technologies to supplement your Illumina data.

ADD REPLY • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

With regards to the metrics between downsampled and entire dataset, which contigs can be used for downstream analyses?

Bcos N50 for downsampled is 212,867 where as the entire dataset is 3,574.

Higher the N50 is better or the lower?

ADD REPLY • link 4.8 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50

N50 can be described as a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.

Higher N50 is better result.

ADD REPLY • link 4.8 years ago by GenoMax 141k