Question: Should I consider contigs.fa or scaffolds.fa from SPAdes output for downstream analyses?
0
gravatar for bioinforesearchquestions
4 weeks ago by
United States
bioinforesearchquestions240 wrote:

Hi,

I am currently working on the bacterial genome sequencing. This is the original post (C: Kmer selection for bacterial WGS denovo assembly using SPAdes or SOAP-denovo). I did two different analyses one with the original trimmed reads and the other with downsampled reads (using BBNorm).

Do I need to use

Downsampled dataset: ~35 million reads

     Genome assembly using SPAdes assembler 
     SPAdes Command: 
     python3 $home/bin/spades.py -o spades_out -1 Sample.R1.fastq.gz -2  Sample.R2.fastq.gz --careful

    Genome Evaluation using QUAST: 
    python3 $home/quast.py scaffolds.fasta --glimmer --use-all-alignments --rna-finding --output-dir quast_output

spades-output

Original trimmed dataset: ~ 430 million reads

    Genome assembly using SPAdes assembler 
    SPAdes Command: 
    python3 $home/bin/spades.py -o spades_out -1 Sample.R1.fastq.gz -2  Sample.R2.fastq.gz --careful

   Genome Evaluation using QUAST: 
   python3 $home/quast.py scaffolds.fasta --glimmer --use-all-alignments --rna-finding --output-dir quast_output
  • Which is the final SPAdes output file (contig.fasta or scaffold.fasta) should be used for downstream analyses?

  • Should I consider K55 final_contigs/final_scaffolds or the contigs/scaffolds fasta file in the main output directory image below?

Considering the scaffold.fasta for both downsampled and original trimmed reads, I evaluated the assembly using QUAST. How to interpret the results from these tables?

Quast-output

contigs quast spades assembly • 217 views
ADD COMMENTlink modified 4 weeks ago by h.mon26k • written 4 weeks ago by bioinforesearchquestions240
1
gravatar for h.mon
4 weeks ago by
h.mon26k
Brazil
h.mon26k wrote:

Which is the final SPAdes output file (contig.fasta or scaffold.fasta) should be used for downstream analyses?

As you didn't use mate-pairs (at least, you didn't say you had mate-pair libraries), the contigs.fasta and the scaffolds.fasta should be almost the same. For most analyses, you should use the scaffolds.fasta.

Should I consider K55 final_contigs/final_scaffolds or the contigs/scaffolds fasta file in the main output directory image below?

No, the final assembly is at the base output folder, not in any subfolder.

Considering the scaffold.fasta for both downsampled and original trimmed reads, I evaluated the assembly using QUAST. How to interpret the results from these tables?

You had excessive coverage (I guess almost 10.000x coverage) and assemblers can not deal with this volume of data, as there will be too many errors in the reads, which will create too many unsolvable bubbles in the de Bruijn graph. Sub-sampling reduced the dataset to a more manageable size and eliminated most sequencing errors.

Possibly, digital normalization would result in a better assembly than just sub-sampling.

edit: as genomax pointed out, you probably have done digital normalization.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by h.mon26k

Downsampled results refer to normalized data since OP used bbnorm.sh as reflected by most numbers. I don't know why N's are higher in normalized data.

bioinforesearchquestions : Were these reads completely cleaned of artifacts before being normalized?

I wonder if a further reduction in data would help. If you have the time to do it you may want to try.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax69k

I don't know why N's are higher in normalized data.

Probably because at the scaffolding step, some contigs could be merged due to the paired-end reads, SPAdes then fills the gaps with a small amount of Ns. The full-data assembly, however, is so fragmented that SPAdes wasn't able to link contigs into scaffolds, my guess is due to reads mapping to multiple erroneous contigs.

ADD REPLYlink written 4 weeks ago by h.mon26k

Hi genomax,

As requested please find the quality reports for the raw and trimmed reads used for the assembly step.

Even for BBNorm step bbnorm.sh in=reads.fq out=normalized.fq target=1000 min=30, I used the trimmed reads to get the normalized downsampled dataset.

Raw_FASTQC raw-fastqc-reports

Trimmed FASTQC trimmed-fastqc-reports

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by bioinforesearchquestions240

Hi @Genomax/h.mon,

Now I have reassembled the scaffolds.fa (from SPAdes output) against a related reference genome using AlignGraph tool. I read this step will improve the assembly.

Generally what are the parameters to be considered from QUAST report for rating the assembly is good enough to proceed further for the downstream analyses?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by bioinforesearchquestions240

Hi @Genomax/h.mon,

Now I have reassembled the scaffolds.fa (from SPAdes output) against a related reference genome using AlignGraph tool. Then using QUAST evaluated the remaining_contigs.fasta from AlignGraph.

Quast-results-for-aligngraph-with-reference

How to interpret this part of the result?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by bioinforesearchquestions240
1

bioinforesearchquestions : I have not used the tools you are referring to above so can't directly assist.

In biostar slack chat with @h.mon we agreed that you are likely not going to get a single closed genome with the data you have. If that is your ultimate goal then you may want to look at alternate sequencing technologies to supplement your Illumina data.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax69k

With regards to the metrics between downsampled and entire dataset, which contigs can be used for downstream analyses?

Bcos N50 for downsampled is 212,867 where as the entire dataset is 3,574.

Higher the N50 is better or the lower?

ADD REPLYlink written 4 weeks ago by bioinforesearchquestions240
1

https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50

N50 can be described as a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.

Higher N50 is better result.

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour