Forum:Benchmark - Comparison of Different NGS Mappers
4
13
Entering edit mode
7.8 years ago

In order to compare read aligner, we use published real-life paired-end DNA/RNA-Seq dataset. All optimal alignments (also multiple mapping loci) of 100,000 read pairs of each sample were obtained by RazerS 3 (full sensitivity mapping tool). In the benchmark, we measured the performance in finding all optimal hits of different NGS mappers with default parameters. We use reads with up to 10 multiple mapping loci, allowing up to 10 errors (mismatches and indels) as 'true positives' in this benchmark.

mRNA-Seq benchmark NGS-mapper • 16k views
8
Entering edit mode

Just of note, cutting bar plot axis is not a very good solution, this makes it look like BWA-MEM is 2 times worse than STAR, while the difference is actually quite small :)

4
Entering edit mode

Well, if you don't cut it, no difference at all is visible. Most people here are researchers and they know how to read these plots. The percentage is also written above the bars... I don't see any potential cheating here. :)

4
Entering edit mode

Most researchers also use breaks in their bars to address this issue...

0
Entering edit mode

For my own curiosity, would one expect much if any variance in performance with these tools? Say if you were to repeat this benchmark but use 100 runs of 100k read pairs randomly sampled from some larger pool of data. It'd be interesting to see ROC plots for these using data from many runs.

p.s. on your website the wall time plots do not have a unit on the ordinate.

1
Entering edit mode

Good point @joe.cornish826. The OP used a data presentation style known to be a poor practice in order to show barely perceptible differences in data which is reported as a single number taken from a process that has a random component. They have no way of knowing if the differences are a complete fluke. In summary, they violated two principles of good data presentation 1) cutting off axes and 2) presenting random data as a deterministic value. The comment that "most people here are researchers who know how to read these plots" is extremely disheartening.

0
Entering edit mode

ariel.balter, without cutting off the axes, nobody could see any differences in the bars, which makes the plot completely unreadable. My principle of good data presentation includes good visibility of the message I want to communicate. The message in this plot is: "All mapping algorithms are above 95% and the differences are minor." And randomly picking 100k reads for this analysis, is statistically absolutely valid for this simple comparison.

We did this plot to give people a feeling for the different mapping algorithms and show them how they could evaluate mapping algorithms. We do not claim that it is perfect. Actually, we were and we are absolutely open for discussions about how to make this benchmark better.

0
Entering edit mode
0
Entering edit mode

Hi there,

I was just wondering based on what factors you have said this are false positive hits.

I really appreciate your response, thanks once again.

best regards,
kiran.

0
Entering edit mode

A false positive hit means that the mapping algorithm found a position for the read, but RazorS showed that there is a better mapping position in the genome.

16
Entering edit mode
7.8 years ago
Asaf 9.3k

1. I was always puzzled about how one determines if an alignment is correct or not and what is considered false negative, false positive or true positive (is true negative an option? unmapable read?) why go as far as 10 mismatches? Why do you need all of them? I personally think that the aligner should be tested using downstream analysis such as differential expression, SNP detection etc.
2. Are you comparing the algorithms or their default parameters? for instance bowtie2 and BWA will do pretty much the same (on DNA) using the same set of parameters (allowed mismatches, seed length).
3. What do we care about? do we care about a few (hundred?) reads that weren't mapped? do we care about run time and memory? Reads that shouldn't be mapped but do map? I guess it depends on the user.
4. Comparing bowtie2 with STAR on RNA-seq? the least you can do is give Tophat a try.
3
Entering edit mode

I think it's pretty clear, that the used testset is from a segemehl publication, since we cited it several times. Nevertheless, I think the benchmark, as it is done by the segemehl developers, is quite good.

But you are right, there are several points open for discussion.

1. Of course one can use less than 10 errors and no multiple mapping loci for benchmarking. These values were just set for this benchmark, because one has to make a decision. But as we stated on the website, if one changes these numbers, the results will look different. We think that multiple mappings are important and should not be discarded, but for some experiments, one might not need them. For DE it is good to only use uniquely mapped reads, but for transcriptome reconstruction they might be useful. It is always a question of what one wants to do.
2. All the exact parameters can be looked up in the publication (link is on the website). For our test, we used the default parameter set, but other parameters were also tested in the publication. And you are absolutely right: If one changes the parameters, the results will directly change.
3. Absolutely correct! That's why we also plotted the memory consumption and the user time. This way people can take a look on the benchmark and make their decision on their own.
4. We are open to add other tools. Tophat is on our ToDo-list and will be added.

We just added this benchmark to our website, since a lot of people keep asking us what tool they should use and there is no simple answer to that. I think the way of how the segemehl people created their testsets is nice and a good starting point. But as you already mentioned, it is from the segemehl publication. We do not give any recommendations here. As you nicely pointed out, the memory consumption and the user time of segemehl is higher and thus it cannot be used by someone without a HPC. For these people, Bowtie2 or BWA is much better and they have a very good sensitivity.

The questions, one should ask oneself before using any mapping tool, are: What do I have (main memory, size of dataset, RNA-/DNA-Seq, etc.) and what is my goal (differential expression, transcriptome reconstruction, variant calling, etc.). The benchmark list can then give a hint of what tool might be useful. But it is far away from being complete.

My tip: First think about your data and your computer environment and then try to choose a reasonable mapping tool.

0
Entering edit mode

4. We are open to add other tools. Tophat is on our ToDo-list and will be added.

Any chance of including BBMap?

2
Entering edit mode

I will add it to our list of ToDos...

0
Entering edit mode

Dear Brian,

We're just about to add BBMap to the benchmark, but there seems to be a bug in your tool.

I try to map the following two reads to the human genome (hg19):

mate1:

@artificial_illumina_paired_len100.fastq.000003910
AAATTATTTTTTGTTTGTTTGTTTGAGACAGGGTTTCACTCTGTCACCCAGGCTGGGGTGCAGTGTGATCTCAGCTCACTGCAACCTCTGGCTCCTGGGT
+
HIIIHIIHHIIHGHHIGHFIIIGIFFIIHIGIHEIIGIGFFEIGIEBIHICGFHHIIHDAICAIHBIHC?@EEIIGFEE@IIFIIIIIFII<GIAEICII


mate2:

@artificial_illumina_paired_len100.fastq.000003910
ACGCCTGTAATTCCAGCACTTTGGGAGGCTGAGGTGGGTGGATCACTTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATAGCAAAACCCCATCTCTA
+
HIHIIIIHHHIHGIHIIIHHHIGGIIIHHGIIFIHHFIIIGHHGFFIGIFEIIHICGIIIIGGHIIEIDHFHEEIIHIEIGI=IIFAIIIICFIIEE;IE


The sam output looks like this:

...
...
@SQ     SN:chrY LN:59373566
@PG     ID:BBMap        PN:BBMap        VN:33.21        CL:java -ea -Xmx41812m BBMap build=1 overwrite=true fastareadlen=500 qin=33 ref=./genome/hg19.fa path=./genome_indices/hg19_b
bmap out=BBMap_bug.sam in=BBMap_bug_1.fq in2=BBMap_bug_2.fq
artificial_illumina_paired_len100.fastq.000003910       97      chr8    30133366        40      62M3D38M        =       77253035        -73253232       AAATTATTTTTTGTTTGTTTGTTTGAGAC
AGGGTTTCACTCTGTCACCCAGGCTGGGGTGCAGTGTGATCTCAGCTCACTGCAACCTCTGGCTCCTGGGT HIIIHIIHHIIHGHHIGHFIIIGIFFIIHIGIHEIIGIGFFEIGIEBIHICGFHHIIHDAICAIHBIHC?@EEIIGFEE@IIFIIIIIFII<GIAEICII    NM:i:
3       AM:i:40
artificial_illumina_paired_len100.fastq.000003910       145     chr16   77253035        37      100M    =       30133366        73253232        TAGAGATGGGGTTTTGCTATGTTGGCCAGGCTGGTCT
TGAACTCCTGACCTCAAGTGATCCACCCACCTCAGCCTCCCAAAGTGCTGGAATTACAGGCGT EI;EEIIFCIIIIAFII=IGIEIHIIEEHFHDIEIIHGGIIIIGCIHIIEFIGIFFGHHGIIIFHHIFIIGHHIIIGGIHHHIIIHIGHIHHHIIIIHIH    NM:i:4  AM:i:
37


One mate maps to chr8 and the other mate to ch16. In column 7 of the sam output, BBMap writes an '=' sign, which says that the other mate is on the same chromosome, which is not the case.

I just wanted to inform you about that problem. We wrote a little script to correct your files, before we can start the benchmark, since we need this information. I'll let you know, once the benchmark is finished.

0
Entering edit mode

Hi David,

I appreciate you looking into adding BBMap to the test. I noticed from the "@PG" line that you're using version 33.21. There was a bug for a while that in some cases would incorrectly put an "=" sign for rnext, which I never noticed in testing because I don't use the rnext field for anything. It's been fixed since v34.28, so the current version (34.41) works fine. But thanks for reporting it!

-Brian

0
Entering edit mode

Ah ok, I installed it a while ago and didn't check if it is the latest version. Good that we realized that before the benchmark. :)

1
Entering edit mode

Any news on the results of BBMap?

0
Entering edit mode

We are working on it. Should be online soon.

0
Entering edit mode

Hi David, were BBMap results eventually included? If so, where can we find them? Thanks in advance

1
Entering edit mode

Point 3: That's why I like to have benchmarks. Actually, a researcher can use your questions, go to the benchmark page and see how the different tools behave. What's your remark here?

6
Entering edit mode
7.8 years ago
lh3 33k

Edit distance is favored by computer scientists as it is well defined, but it is not always a good standard for biological data. A true hit with a 11bp indel will be considered as false if we only allow 10 differences. Even if a true hit has a 10bp indel, the best hit under the edit distance scoring may be different. Taking RazorS, an edit-distanced based mapper, as the ground truth is flawed to some extent. Also, RazorS is not aware of splicing if I am right, how is it used to evaluate RNA-seq mappers such as star and Segemehl?

More generally, most mapper benchmarks are biased by the view of the designers or the mapper developers. When I want to know the relative performance of two mappers, I tend to read multiple papers that evaluate the two but are not written by the developers of the two mappers. For example, a paper describing a new mapper "A" evaluates older mappers B, C, D and E; a paper describing B evaluates C, E and F. Then we have two relatively unbiased observations of C and E. I see the ensemble of benchmarks as a better benchmark than each individual one and than any benchmarks I made by myself. Segemehl is an old mapper but rarely evaluated by others. This makes it hard for me to understand where it stands.

As Asaf pointed out, mapping is only an intermediate step. These days, I also more like to see how mapping affects downstream analyses such as variant calling, expression, etc. A complication is that a downstream tool is designed with some specific mappers in mind. For example, GATK usually performs better on bwa alignment than on bowtie2 although bowtie2 and bwa are similar in accuracy to some other standards. For another example, a preprint (I forget which) claims that cufflinks works better with tophat2 mapping although star is shown to be more accurate by the same authors. Nonetheless, you will find on RNA-seq variant calling, the GATK team recommend star over tophat2. Sometimes, we are not choosing the best tool, but the chain of tools that work best together.

2
Entering edit mode

I completely understand your first argument and will double-check that question with the guys who created this benchmark and wrote the scripts.

Your second argument is also correct and I have to highlight here, that we do not want to give any recommendations. Segemehl performs best, but the difference in sensitivity is very small and there is the drawback of its memory consumption. I just think that this benchmark is just one possibility to get a feeling for mappers. I also know your benchmarks and they helped me a lot. This one is just another one, with a different point of view. And I'm pretty sure that hundreds will follow. In the end, the user has to decide which tool he or she wants to use.

The last point is a very good one and I really think that it would make sense to think about such a test. I'm not sure, how one would/should design it, since, as you said, there are strong connections between some mapping tools and downstream tools. But if Asaf has an idea, I'm open for any discussion for a benchmark using downstream analyses.

1
Entering edit mode

On the last point, a solution is to pair each mapper with each mainstream caller/quantifier. You will end up with a m-caller-by-n-mapper matrix. Lots of computation. I did this for bowtie2 and bwa-mem on variant calling in my recent Bioinfo review.

1
Entering edit mode

Ok, I'll take a look. But I think comparing all mapper with all caller/quantifier and all different possible parameter settings, is not an option for us. In the end of the day such a benchmark is probably too time consuming for us.

2
Entering edit mode

You don't need to try different settings. Just use one setting recommended by the developers. In the real world, few have the time and enough knowledge to tune parameters.

2
Entering edit mode

Following up your question about RazorS and splicing: Split-reads were not considered in this benchmark! The idea of using a mRNA-Seq dataset for this benchmark was the higher amount of errors in mRNA-Seq reads (even without split-reads). Post-transcriptional modifications, amplification errors, etc. result in reads which are harder to map.

From the supplement of the paper: 'For both Illumina mRNA-seq datasets, the post- processing involved removing reads that possibly overlapped exon-exon junctions. To achieve this, the entire dataset was mapped using segemehl (with -S option), STAR [3], TopHat2 [4], and SOAPsplice [5], and reads which were split-mapped by any of these tools were removed prior to down-sampling. In case of the paired-end mRNA-seq dataset, only paired-end reads were kept where both ends were not split-mapped by any of the tools.'

Split-read mapper were analyzed in another publication:

Hoffmann S, Otto C, Doose G, Tanzer A, Langenberger D, Christ S, Kunz M, Holdt L, Teupser D, Hackermüeller J, Stadler PF. A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and fusion detection. Genome Biol. 2014 Feb 10;15(2):R34.

Figure: Performance of various read aligners on simulated data sets with different splice events. For simulated 454 reads (400 bp), segemehl performed significantly better in detecting conventional and 'non-conventional' (strand-reversing, long-range) splice junctions. segemehl was the only tool that consistently recalled more than 90% of conventional splice junctions. For 'non-conventional' splice events, segemehl extended its lead to 40% for recall without losing precision. Likewise, compared to three of the seven alternative tools, segemehl had a 30% increase in recall for irregularly spliced Illumina reads (100 bp). Compared to TopHat2, it had a slight increase while reporting significantly fewer false positives. At the same time, segemehl's performance with simulated, regularly spliced Illumina reads was comparable with the other seven tools tested. gs, GSNAP; ms, MapSplice; ru, RUM; se, segemehl; sm, SpliceMap; so, SOAPsplice; st, STAR; to, TopHat2.

3
Entering edit mode
7.7 years ago

We don't want to claim any universal superiority or inferiority of any tool!

This benchmark is just one possible way of comparing different NGS mappers. We think that it might be interesting for researchers who are new in the field. In-house, we use all four mapping tools, depending on the task!

Feel absolutely free to join the discussion here! Perhaps we can all together design a completely new test, which is much better!

2
Entering edit mode

Rather than using RazerS 3, or any other tool for evaluating mappings that is unaware of the actual truth, I recommend using reads tagged by their genomic origins. Any other approach is biased, and inherently incorrect. For example, consider a 100bp read with 100 substitutions that coincidentally make it perfectly align to somewhere else in a genome. Using an edit-distance based aligner to determine the correct mapping, you will obviously get the wrong answer. Nothing can get the RIGHT answer, of course, since all initial information has been lost; and as such, awarding any points to any algorithm in that case demonstrates faulty grading. Similarly, any grading metric based on prediction rather than truth - even if some information content is retained in the data, unlike my example - cannot yield a fair answer, due to bias.

An unbiased approach:

Generate synthetic reads from random locations in the reference genome. Randomly mutate each read in specified ways, retaining the original genomic position of each base, and ideally, reflect error-mode mutations in quality scores. Grade mapping programs on whether they correctly align read bases to their original reference bases.

Any other approach is inferior in determining absolute mapping accuracy (though as lh3 mentions, mapping accuracy and the quality of end results are not equivalent). Specifically, an edit-distance-based approach, particularly with RNA-seq data (as in this test), will always give incorrect results for reads spanning introns sufficiently greater than read length; thus, grading such reads based on edit distance will consider an edit-distance-based aligner to be correct and a splice-aware aligner to be wrong, when the opposite is true. Edit distance is a useful heuristic for guessing at truth, but should never be used to quantify correctness; and particularly, should never be presented as truth, because it is not.

2
Entering edit mode

Assume one of your synthetic molecules of length 100 nt randomly picked from the genome. Then, you create a synthetic read by randomly mutating it. What if the inserted mutations result in a read that maps to a completely different position in the genome than the molecule maps to? The mapper obviously finds the correct position of the read (which is not the correct position of the molecule). In your set-up this mapping would be a false positive, but actually the mapper placed the read correctly, the optimal alignment of the read is just a different than the molecules one.

I also have a problem with "randomly mutating" randomly selected synthetic molecules. In real-life the molecules are not randomly selected, but there are always biases. And random mutations that 'ideally reflect error-mode mutations in quality scores' are hard to create. For me, too many assumptions which are synthetically designed.

What I like in this test-set: Real-life biases in picking (only the down-sampled test-set of 100k reads might introduce a bias; should be tested) and real-life mutations, since it is a real life dataset.

Finding the optimal alignments of the reads (using RazerS 3) assures that we get a set of mapping locations that can/should be found by the mapping tools (which don't have to be the locations of origin).

3
Entering edit mode

Assume one of your synthetic molecules of length 100 nt randomly picked from the genome. Then, you create a synthetic read by randomly mutating it. What if the inserted mutations result in a read that maps to a completely different position in the genome than the molecule maps to? The mapper obviously finds the correct position of the read (which is not the correct position of the molecule). In your set-up this mapping would be a false positive, but actually the mapper placed the read correctly, the optimal alignment of the read is just a different than the molecules one.

Say a read originates from location A, and due to mutations better matches location B using some metric (such as edit distance). If a mapper places the read at location B, it is wrong. When doing something like calling variations with respect to a reference, or calculating coverage, or in fact any analysis, you will draw false conclusions. No evaluation should award points to a program that yields the wrong answer merely on the basis that it is congruent with some arbitrary metric of similarity.

Neither RazerS 3 nor any other algorithm finds "optimal alignments" that "should be found by the mapping tools". RazerS 3 may yield alignments that maximize scores according to some arbitrary scoring function, but these alignments do not optimize some other scoring function which may better reflect evolutionary mutation modes and pressures. If RazerS 3 maximizes scores based on edit distance, then it specifically does NOT maximize scores based on affine-transform-weighted scoring functions, substitution-weight-matrix functions, complexity/entropy-based functions, or indeed any other function - many of which are generally superior to edit distance, which is only commonly used because it is computationally inexpensive, easy to implement, and easy to interpret, not because it has any particular biological significance (as do the other scoring functions mentioned).

Of course a program that operates by maximizing some arbitrary scoring function F will outperform programs that operate by maximizing different scoring functions, when their results are evaluated based on scoring function F! That's like grading grading tests in a class based on how close answers were to those of your favorite student. Obviously your favorite student will get a perfect, and everyone else will be penalized for questions that they answer correctly, if the favorite student gets them wrong.

2
Entering edit mode

This comment reminds me of another question that is not obvious from the benchmark page - are you running mappers in the paired-end mode? Suppose for two reads in a pair, read1 has a single perfect match at chr1:+1000 with no 1/2/3/4-mismatch hits; read2 has a single perfect match at chr2:+2000 and a 1-mismatch hit at chr1:-1200. The 1-mismatch hit of read2 is very likely to be correct although it is not the best hit. Will you consider this 1-mismatch hit correct?

Also, the page says "true positives are reads with up to 10 multiple mapping loci, allowing up to 10 errors (mismatches and indels)". Do you really mean "true positives" here? Or simply "positives", or "reads used in this benchmark"? What is the exact definition of "on-target hits" and "false positive hits" especially when a read has multiple equally best hits? Do you consider the pairing constraint?

2
Entering edit mode

For the paired-end mappings, only paired-end alignments with both ends being optimal alignments (minimal edit distance) and having additional constraints (same chromosome, correct order, insert size between 250 and 500) were considered to be 'positive'. This way it is assured that no mapper is favored... nevertheless, the results might be too optimistic. And of course, all mapper were run in the paired-end mode for this benchmark.

To your second question: All input reads were mapped using RazerS3 and only the reads resulting in 'positive' mappings (up to 10 errors and 10 multiple loci) were used for the benchmark... the others are ignored. This way, I would say that "reads used in this benchmark" would fit best. I defined these constraints as "true positives" before and that might be misleading, since biologists and informaticians have a different definition when talking about "true positives". Sorry for the confusion.

1
Entering edit mode
7.8 years ago

I have several questions regarding segmenthl benchmark results interpretation.

First, having looked at Table S3 and seen that it requires 70GB RAM to align to mammalian genome, I'm wondering if this extra memory really worth those 5-3% increase in sensitivity?

The running time also suggests that it would take ~100 hours to map all reads from HiSeq Lane (given there is no redundancy-based optimization in segmenthl), which is quite a lot.

The last thing is that it seems like 100,000 reads is quite small subset, so several re-samples followed by RazerS3 mapping would be nice to see the variance in precision/recall estimates.

1
Entering edit mode

Well, question one has to be answered by yourself. Using multithreading will assure that you don't have to wait for 100 hours. And the last question... well, try it! :)

0
Entering edit mode

Ok, I see. I've just failed to find any characteristics of benchmark hardware in both the paper and the supplementary, so I assumed the benchmark timing comes from several cores, not a single one. And what would be the memory requirement per CPU then?

1
Entering edit mode

Well, I think most of the time shown in the plot is for loading the index into the memory. That has to be done once, resulting in a small increase in time when mapping more reads. You can run segemehl on your machine and check the used time by typing "time" directly in front of your call. But you make a point here perhaps we should unlink the index-loading step from the mapping and show it in two different plots.