Question

Closed:InsideDNA: Benchmarking seven most popular genome assemblers

0

Entering edit mode

7.9 years ago

shilparaopradeep ▴ 190

Introduction: metrics for genome assembly comparison

Several factors influence performance of de novo genome assemblers: read coverage, GC-content, repeats fraction, etc. This report aims to elucidate the effect of read coverage on the performance of de novo assemblers, while the other aspects will be covered in the future InsideDNA reports. In this study, we benchmark seven popular assemblers: SPAdes, Velvet, SOAP2, ABySS, MaSuRCA, DISCOVAR, and Newbler. Two types of metrics are most often used to evaluate assembler performance: computational load (computing time, memory and CPU usage) and quality statistics (e.g. N50, Feature Response Curve, assembly accuracy). Here, we concentrate only on quality statistics. We chose N50 and assembly accuracy as our evaluation statistics. N50 is defined as follows: given a set of contigs (scaffolds) of varying lengths, the N50 length is the length N for which 50% of all bases in all contigs are in the contigs of length L < N. We used QUAST to calculate N50. For each assembler we took a final assembly result (contigs or scaffolds). Assembly accuracy, on the other hand, is a less trivial metric and we will describe it in further depth in the report.

Methods and dataset simulation for genome assembly comparison

To run assemblers, we chose six high coverage WGS datasets (NA19238, NA12878, NA12892, NA19240, NA19239, and NA12891) from the 1000 Human genomes project. Prior to extracting reads from the WGS datasets, we ran Tandem Repeats Finder on the GRCh37 human genome, in order to mask low complexity regions and repeats.

idna_submit.py -t trf_masker -c 4 -r 3.6 -e '/srv/dna_tools/trf409/trf /data/userXXX/assembly_benchmark/human_g1k_v37.fa 2 7 7 80 10 50 500 -f -d -m'

Once low complexity regions were masked, we manually selected 12 fragments with a minimum length of 50K and located on different chromosomes from the GRCh37 dataset. These fragments were used as seeds for reads extraction from the 1000 Genomes datasets bam files.

To extract raw reads from the six WGS datasets, we run SAMtools, each time specifying the coordinates of one of the seed fragments.

idna_submit.py -t samtools_view -c 4 -r 3.6 -e 'idna_samtools_view -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/sample/sample.bam chrN:start-end > /data/userXXX/assembly_benchmark/reads.bam'

This step allowed us to extract all raw reads mapped to a specific seed fragment. Once we obtained raw reads per each seed fragment, we counted an average read depth. The coverage depth varied from 40 to 85 reads, depending on the fragment. We then homogenized read depth and subsampled reads to a coverage of 40 reads in all samples.

idna_submit.py -t samtools_view_sample -c 4 -r 3.6 -e 'idna_samtools_view -b -s {n}.{i} /data/userXXX/assembly_benchmark/reads.bam > /data/userXXX/assembly_benchmark/sampledreads.bam'

where {i} specifies fraction of reads sampled and {n} specifies random seed for reads selection, for example -s 1.2 parameter would result in extraction of 20% of reads from bam file

The procedure resulted in 12 sets of raw reads (two sets per each WGS dataset), each “mappable” to the one of twelve seed fragments. All of the read samples contained paired-end reads with an average length ~ 250 bp. To evaluate the effects of read coverage, we used following strategy: for each raw reads dataset we sampled a fraction of reads with a step size of 10%. This resulted in a coverage depth of 4 to 40. We replicated sampling three times for each dataset, so in total we generated 360 datasets for assembly evaluation. These samples were given as an input for seven genome assemblers in fasta/q format.

There are several additional parameters (except from the read coverage), which could influence the performance of assemblers: fragment sequence, sequencing platform (experimental setup in general) and different read sets after random sampling. Deciphering the influence of these three factors from the impact of read coverage is notoriously difficult, so we did not attempt to separate them. Instead, we used read sets from different samples and different genomic regions in order to achieve maximal variability of our data. We assumed that if the difference in assembler performance is noticeable and consistent across different experimental set ups, it will be a strong reason to claim that different assemblers achieve different results in certain read coverage.

The following parameters (commands) were used for assembling:

SPAdes:

idna_submit.py -t spades_assembly -c 4 -r 3.6 -e 'idna_spades --careful --pe1-1 /data/userXXX/assembly_benchmark/end1.fq --pe1-2 /data/userXXX/assembly_benchmark/end2.fq -o /data/userXXX/assembly_benchmark/spades_outdir'

scaffolds are used for QUAST analysis

Velvet:

idna_submit.py -t velveth_assembly -c 4 -r 3.6 -e 'idna_velveth_{k-mer} /data/userXXX/assembly_benchmark/velvet_outdir {k-mer - 1} -fastq -shortPaired -separate /data/userXXX/assembly_benchmark/end1.fq /data/userXXX/assembly_benchmark/end2.fq'
idna_submit.py -t velvetg_assembly -c 4 -r 3.6 -e 'idna_velvetg_{k-mer} /data/userXXX/assembly_benchmark/velvet_outdir -cov_cutoff auto -ins_length {insert_length} -exp_cov auto'

{k-mer} values are equal to: 32, 64, 96
contigs are used for QUAST analysis

SOAPdenovo2:

idna_submit.py -t soap_assembly -c 4 -r 3.6 -e 'idna_SOAPdenovo-{k-mer}mer all -s /data/userXXX/assembly_benchmark/sample.config -K {k-mer} -R -o /data/userXXX/assembly_benchmark/soap_outdir'

{k-mer} values are equal to: 63, 127
sample.config - config file with input data description
scaffolds are used for QUAST analysis

DISCOVAR:

idna_submit.py -t discovar_assembly -c 4 -r 3.6 -e 'idna_DiscovarDeNovo READS=/data/userXXX/assembly_benchmark/input.bam OUT_DIR=/data/userXXX/assembly_benchmark/discovar_outdir'

scaffolds are used for QUAST analysis

ABySS:

idna_submit.py -t discovar_assembly -c 4 -r 3.6 -e 'idna_abyss-pe k={k-mer} name=name in="/data/userXXX/assembly_benchmark/end1.fq /data/userXXX/assembly_benchmark/end2.fq"'

{k-mer} values are equal to: 32, 64, 96
scaffolds are used for QUAST analysis

MaSuRCA:

idna_submit.py -t masurca_assembly -c 4 -r 3.6 -e 'idna_masurca /data/userXXX/assembly_benchmark/config.txt'
idna_submit.py -t masurca_assembly -c 4 -r 3.6 -e 'idna_masurca_assemble.sh'

config.txt – config file with input data description
scaffolds are used for QUAST analysis

Newbler:

idna_submit.py -t newbler_assembly -c 4 -r 3.6 -e 'idna_runAssembly -o /data/userXXX/assembly_benchmark/newbler_outdir -force -notrim -a 10 -het -ml 35 -sl 16 -ss 12 -p /data/userXXX/assembly_benchmark/allpairedreads.fastq'

allpairedreads.fastq – fastq file with both forward and reverse paired reads
scaffolds are used for QUAST analysis

All the resulting scaffolds (contigs) were given as an input to QUAST, with the seed fragments serving as a reference sequence.

idna_submit.py -t quast_bench -c 4 -r 3.6 -e 'idna_quast -R /data/userXXX/assembly_benchmark/reference_frag.fa -e -o /data/userXXX/assembly_benchmark/quast_outdir /data/userXXX/assembly_benchmark/scaffolds.fasta'

QUAST output was further used to get N50 value and to estimate assembly accuracy (or assembly error rate). If the reference sequence is given, QUAST aligns each scaffold to a reference sequence and finds all indels and mismatches. Also, QUAST reports all unaligned scaffolds and misassemble events. We used the sum of all unaligned scaffold lengths with all indels lengths and all mismatches as a measure of the error rate, because this sum is proportional to the number of wrong bases in scaffolds (proportional but not equal because some mismatches and indels originate from natural mutations). So finally, for each read sample and each assembler run we had two values: N50 and error rate, which were plotted against the coverage value (for SOAP2, ABySS and Velvet we present the value of the top performant k-mers size assembly) (Fig. 1).

Figure 1. This chart represents the dependence of assembly N50 value for each of 7 assemblers (SPAdes, Velvet, SOAP2, ABySS, Discovar, MaSuRCA and Newbler) on coverage. Coverage values are listed on the X axis; N50 values divided by fragment length (50000 bp) are given on the Y axis. Each point represents an average N50 value across all 36 (3*12) samples. Different colours correspond to different genome assemblers (see legend).

Results and discussion

Based on N50 values we group assemblers into two classes (Fig 1). Five assemblers (SPAdes, Velvet, Discovar, MaSuRCA and Newbler) produce on average higher N50 values compared to two other assemblers (SOAP2 and ABySS). The result is consistent across different coverage values. Also, it seems that SPAdes has the highest average N50 value at low read coverages (< 16).

We also looked at how assembly error rate depends on the coverage (Fig. 2).

Figure 2. This chart represents dependence of assembly error rate values (Y axis) for each of 7 assemblers (SPAdes, Velvet, SOAP2, ABySS, Discovar, MaSuRCA and Newbler) on coverage (X axis). Each point represents average assembly error rate across all 36 (3*12) samples.

As we can see, SOAP2 has substantially higher error rate than all other genome assemblers; however, ABySS is quite accurate. Moreover, paired Wilcoxon test showed that the difference in mean error rate values between any another genome assembler and SOAP2 is significant at any read coverage value.

In conclusion, we note that all of assemblers (SPAdes, Velvet, SOAP2, ABySS, Discovar, MaSuRCA and Newbler) showed a similar trend of N50 growth with a growing read coverage, but the shape of the curve was different for different assemblers. We conditionally divided our assemblers into 2 groups, based on the curve shape. There were five genome assemblers (SPAdes, Velvet, Discovar, MaSuRCA and Newbler) in the first group; all of them reached plateau of N50 relatively “quickly” at a read coverage of ~16. Also, there were two genome assemblers (SOAP2 and ABySS) in the second group, both of which had worse N50 values than assemblers in the first group, across all coverage values. These two assemblers didn’t reach maximal performance on given samples, even at the highest coverage (~ 40 reads depth), and their N50 was approximately 70% of the seed fragment size.

Importantly, all three assemblers that require k-mer size values as a mandatory parameter (Velvet, SOAP2 and ABySS) had the worst average N50 values. For each of these three assemblers, we used several k-mer size values (such as, 32, 64, 96, 127, etc.) and chose the one which had the highest N50 value. It might be that an optimal k-mer value is different from the chosen here, so theoretically we could expect a different performance. Interestingly, SOAP2 showed the worst error rate amongst all assemblers, while ABySS, on the other hand, was quite accurate. It means that ABySS produces short but correct scaffolds, while SOAP2 makes a lot of mistakes (wrong (unalign) contigs, indels, etc.). However, other assemblers reached almost the same limit values of error rate at maximal read coverage. Surprisingly, mean values of assembly error rate for SPAdes, Newbler and MaSuRCA were around 75 at maximal (40 reads depth) read coverage, which is very close to an expected ~ 50 (0.01% out of 50000 bp), if taking into account only mutational processes. It means that these 3 assemblers reached (or were very close to) minimal possible error rate.

Based on our analysis, SPAdes, Velvet, Discovar, MaSuRCA and Newbler showed good results in terms of N50 (scaffolds length distribution) and accuracy of scaffolds assembly. Poor results in both of these metrics were observed for SOAP2 assembler. ABySS had one of the lowest average N50 value per each coverage, but achieved good quality of the assembly.

Limitations of the study

Nevertheless, there are some limitations in our analysis. First of all, we used a short fragment sequence (50k bp) without low complexity regions and repeats. This allowed us to eliminate other interfering parameters that could influence assembler performance, but made our test quite artificial and especially distinct from the Eukaryotic genomes assembly task. Also, SOAP2 position itself as a very fast assembler, so poor results (based on our tests) might be compensated by a high speed (computational parameters will be covered in the subsequent reports). In addition, ABySS was designed for short reads assembly, whereas we used only 250 bp reads in our tests. This factor could potential bias the assembler performance and lead to lower N50 values.

insideDNA genomics • 2.9k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 7.9 years ago by shilparaopradeep ▴ 190