contigs (>= 0 bp) 52187

Question

compare the assembler MaSuRCA and SPAdes for assembly unmapped reads

0

Entering edit mode

6 weeks ago

Sony ▴ 10

Hi everyone,

I have paired end reads whole genome sequencing data of Brassica. I performed mapping short reads to the reference genome and extracted the unmapped reads. Then, I assembly these unmapped reads into de novo contigs using (MaSuRCA and SPAdes). Here are assembly stats generated by QUAST:

assembly stats for assembled sequence generated by MaSuRCA:

Assembly primary.genome.scf

# contigs (>= 0 bp)         6180
# contigs (>= 1000 bp)      1284
# contigs (>= 5000 bp)      8
# contigs (>= 10000 bp)     0
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      4715175
Total length (>= 1000 bp)   2119701
Total length (>= 5000 bp)   47692
Total length (>= 10000 bp)  0
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   3546
Largest contig              6948
Total length                3703670
GC (%)                      39.04
N50                         1110
N90                         600
auN                         1449.9
L50                         1030
L90                         2868
# N's per 100 kbp           0.00

assembly stats for assembled sequence generated by SPAdes:

Assembly contigs

contigs (>= 0 bp) 52187

# contigs (>= 1000 bp)      2881
# contigs (>= 5000 bp)      47
# contigs (>= 10000 bp)     1
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      20642697
Total length (>= 1000 bp)   5141662
Total length (>= 5000 bp)   287508
Total length (>= 10000 bp)  12949
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   8583
Largest contig              12949
Total length                9033035
GC (%)                      37.17
N50                         1133
N90                         578
auN                         1612.0
L50                         2293
L90                         6900
# N's per 100 kbp           0.00

Then, I screen the contamination and remove it from the assembled sequence (I used Foreign Contamination Screening FCS-GX NCBI) and I got 2 output files: clean.fasta (the assembled sequence after romove contamination) and contamination.fatsa (List of comtanination sequence). Here are sequence stats of raw assembled sequence, cleanly assembled sequence, and contamination sequence: enter image description here Based on the above stast, I saw that SPAdes generated has contamination contig than MaSuRCA I aslo perform RepeatModeler for these clean.fasta to detect the repetitive sequence inside clean.fasta. The number of repetitive sequence in clean.fasta (MaSuRCA) is 28 sequences and SPAdes has 122 sequences. The RepeatModeler masked stats for MaSuRCA clean.fasta sequence:

Sample Stats: Sample Size 3964084 bp

   Num Contigs Represented = 4900
   Non ambiguous bp:

         Initial: 3964084 bp
         After Masking: 3870473 bp
         Masked: 2.36 %

-- Input Database Coverage: 3964084 bp out of 3964159 bp ( 100.00 % )

The RepeatModeler masked stats for SPAdes clean.fasta sequence:

Sample Stats:

   Sample Size 10000678 bp

   Num Contigs Represented = 21926
   Non ambiguous bp:
         Initial: 10000678 bp
         After Masking: 9681079 bp
         Masked: 3.20 %

Based on these information, are there any suggestion for me to choose which assembler is better in this case ? (I mean which assembled sequence is better ). Thank you.

SPAdes unmapped_reads. assembly. MaSuRCA. • 140 views

ADD COMMENT • link 6 weeks ago by Sony ▴ 10