Edit distance is favored by computer scientists as it is well defined, but it is not always a good standard for biological data. A true hit with a 11bp indel will be considered as false if we only allow 10 differences. Even if a true hit has a 10bp indel, the best hit under the edit distance scoring may be different. Taking RazorS, an edit-distanced based mapper, as the ground truth is flawed to some extent. Also, RazorS is not aware of splicing if I am right, how is it used to evaluate RNA-seq mappers such as star and Segemehl?
More generally, most mapper benchmarks are biased by the view of the designers or the mapper developers. When I want to know the relative performance of two mappers, I tend to read multiple papers that evaluate the two but are not written by the developers of the two mappers. For example, a paper describing a new mapper "A" evaluates older mappers B, C, D and E; a paper describing B evaluates C, E and F. Then we have two relatively unbiased observations of C and E. I see the ensemble of benchmarks as a better benchmark than each individual one and than any benchmarks I made by myself. Segemehl is an old mapper but rarely evaluated by others. This makes it hard for me to understand where it stands.
As Asaf pointed out, mapping is only an intermediate step. These days, I also more like to see how mapping affects downstream analyses such as variant calling, expression, etc. A complication is that a downstream tool is designed with some specific mappers in mind. For example, GATK usually performs better on bwa alignment than on bowtie2 although bowtie2 and bwa are similar in accuracy to some other standards. For another example, a preprint (I forget which) claims that cufflinks works better with tophat2 mapping although star is shown to be more accurate by the same authors. Nonetheless, you will find on RNA-seq variant calling, the GATK team recommend star over tophat2. Sometimes, we are not choosing the best tool, but the chain of tools that work best together.
Just of note, cutting bar plot axis is not a very good solution, this makes it look like BWA-MEM is 2 times worse than STAR, while the difference is actually quite small :)
Well, if you don't cut it, no difference at all is visible. Most people here are researchers and they know how to read these plots. The percentage is also written above the bars... I don't see any potential cheating here. :)
Most researchers also use breaks in their bars to address this issue...
For my own curiosity, would one expect much if any variance in performance with these tools? Say if you were to repeat this benchmark but use 100 runs of 100k read pairs randomly sampled from some larger pool of data. It'd be interesting to see ROC plots for these using data from many runs.
p.s. on your website the wall time plots do not have a unit on the ordinate.
Good point @joe.cornish826. The OP used a data presentation style known to be a poor practice in order to show barely perceptible differences in data which is reported as a single number taken from a process that has a random component. They have no way of knowing if the differences are a complete fluke. In summary, they violated two principles of good data presentation 1) cutting off axes and 2) presenting random data as a deterministic value. The comment that "most people here are researchers who know how to read these plots" is extremely disheartening.
ariel.balter, without cutting off the axes, nobody could see any differences in the bars, which makes the plot completely unreadable. My principle of good data presentation includes good visibility of the message I want to communicate. The message in this plot is: "All mapping algorithms are above 95% and the differences are minor." And randomly picking 100k reads for this analysis, is statistically absolutely valid for this simple comparison.
We did this plot to give people a feeling for the different mapping algorithms and show them how they could evaluate mapping algorithms. We do not claim that it is perfect. Actually, we were and we are absolutely open for discussions about how to make this benchmark better.
this post was cited in : https://ieeexplore.ieee.org/abstract/document/8646637
thanks for your time.
I was just wondering based on what factors you have said this are false positive hits.
I really appreciate your response, thanks once again.
A false positive hit means that the mapping algorithm found a position for the read, but RazorS showed that there is a better mapping position in the genome.