Tutorial: For Short Reads, Which Aligners Find All Hits Given Certain Edit Distance Threshold?
27
gravatar for lh3
7.8 years ago by
lh332k
United States
lh332k wrote:

Here is the number alignments of five 31bp reads reported by 5 aligners:

Sequence                          bwa -R10000/GEM -s1  MrFast -e2      bowtie -l31      bowtie2
--------------------------------------------------------------------------------------------------------
AGGGCCAGCTAAATTTTGTATTTTTAGAAGA   27 2-mismatch        21 2m           27 2m            1 2m; 1 3m
GTCAGTATGGCCAATTTCACGATAATGATTC   50 2-mismatch        30 2m           50 2m            unmapped
TTTGATTCTTCTGTCTTTTCTTCTTTAATAG   103 1m; 4119 2m      60 1m; 2053 2m  103 1m; 4098 2m  103 1m; 24 2m
GCCCCCTGAGCAGCTGGGATTACAGGCGAGC   10 1m; 121 2m        10 1m; 115 2m   10 1m; 73 2m     10 1m; 7 2m; 17 3m
ATCTTTCTCTTCTTATTTTTGTATGAATATA   10 1-del             10 1d           unmapped         10 1d

Only BWA and GEM report all the hits up to 2 mismatches/indels. Bowtie1 finds the best hits, but not gapped hits or suboptimal hits. Bowtie2 and MrFast may miss 2-mismatch hits. On chr1, segemehl does not output suboptimal hits. It finds all the 1-difference hits but misses two 2-mismatch hits. I have not tried segemehl on whole-genome as it is inconvenient to run.

I have also tried bwa, bowtie and gem at a larger scale. For ~4200 31bp reads having 2 mismatches, bwa and gem report exactly the same number of hits for every read. Bowtie on the other hand misses hits for 25% of reads - bowtie does not guarantee to find all hits. <del>If I use bwa and gem to find 2-difference hits, bwa finds more, suggesting that gem does not guarantee to find all gapped alignments.</del> Note that I am not saying that bwa can find all 2-difference hits. In theory it should, but in practice, optimization to the original algorithm may lead to bugs in corner cases, which I have never checked.

In summary, among the mappers tested here, bowtie, bowtie2, mrfast and segemehl do not guarantee to find all the 2-mismatch hits for ~30bp reads. Only bwa and gem achieve full-sensitivity to mismatches. <del>Bwa is more sensitive than gem to gapped alignments.</del> Nonetheless, at the full sensitivity, gem works with 100bp reads, but bwa has to use heuristics. Gem is a better mapper than bwa (IMHO gem is the first mapper better than the origianl bwa all around), as long as you use it the right way.

EDIT: for gapped alignment, bwa may produce duplicated hits (a bug I thought I have fixed...). That said, gem does not guarantee to find all gapped hits. I have found one example with the best match contains 1 gap, while gem only reports 2-mismatch hits.

EDIT2: Conclusions (so far):

With the suggestions from nkrumm and David, we can say that bwa, gem, mrfast, segemehl and RazerS3 probably guarantee to find hits within certain hamming distance threshold. These mappers all support gapped alignment, but whether they guarantee to find all gapped hits given an edit-distance threshold has not been concluded. Neither bowtie nor bowtie2 guarantees to find all hits.

For practical uses, bwa's try-and-reject strategy is probably faster for <32bp sequences. MrFast, gem and razers3 all use the pigeon hole approach. They should be faster for longer reads (e.g. 36-100bp). BWA and gem index the genome. They have fixed memory footprint and is easier to run (you do not need to split the read file when it is too large), and also run much faster if you only have a few to a couple of thousands of reads at hand. I would recommend them in general.

All that being said, for 100bp reads, finding all hits given a certain edit-distance threshold is not sufficient. We would like to see if alignments containing one reasonably long gap (say 7bp insertion) and to rank the hits by affine gap penalty instead of by edit distance. In computer science, edit distance is popular as it is clearly defined and many efficient algorithm exist to find matches. However, affine gap penalty is of more biological/evolutionary meaning.

ngs alignment tutorial bwa • 13k views
ADD COMMENTlink modified 7.7 years ago by pan.alexiou10 • written 7.8 years ago by lh332k
2

This article compares 9 mappers and describes their algorithms. It doesn't deal with indels and paired reads (for the moment) but it's worth reading. Their pipeline is available here.

ADD REPLYlink written 7.8 years ago by tflutre510

This is a great read. Thanks for posting it.

ADD REPLYlink written 7.8 years ago by Alex Reynolds31k
1

unrelated, but are you going to add GEM to the mapper ROC page?

ADD REPLYlink written 7.8 years ago by brentp23k
3

It is not that simple. Firstly, I need to simulate more reads. I have seen fluctuations in case of gem. Secondly, I need to write a proper gem-to-sam converter for paired-end reads. I also need to figure out how to squeeze the best out of gem given PE reads. Thirdly, although gem uses a better algorithm, there are quite a few things to do to make it a drop-in replacement of bwa/bowtie/etc. I am a little concerned with including a mapper which has potentials but is not usable as a typical mapper yet (EDIT: by typical I mean in the context of variant calling). All that said, I should emphasize that the gem algorithm is better for >100bp reads. In principle, I can take its single-end output and generate a more accurate bwa-flavored SAM with halved total CPU time in comparison bwa. PS: most short-read mappers do not collect enough information to be converted to bwa-flavored alignments - they ignore suboptimal hits for speed. Gem is one of the few that make this possible.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k
2

Just out of curiosity, what keeps GEM from being a "typical mapper" in the contest of variant calling? Is it really just lack of a good gem-2-sam converter, or is there something inherent in the algorithm?

ADD REPLYlink written 7.8 years ago by Sean Davis26k
2

Mostly about the implementation. From its paper, the developers might be thinking mapping quality is not of much use, so gem does not compute mapping quality. For single-end, I can estimate good mapq using the bwa method, but it is hard to do the same for PE reads as it outputs less information. Also at my hand, its PE mapping is less accurate than SE, but it should not be so given that gem is clearly better than bwa for SE (that is why I said I have not got the best from gem for PE). In addition, gem does not perform affine-gap alignment and does not left-align gaps. You need a separate step to refine CIGAR. On its algorithm, a potential concern is that it requires full-length match in its current implementation. When reads get longer and longer, local alignments like bowtie2 or bwa-sw is preferred. Another caveat is the hard edit-distance threshold on high-quality bases (it allows many low-quality mismatches). By default, it is 4%. If there is an indel longer than 4bp or there are 5bp mismatches on a 100bp read, gem will not map it. Such reads only constitute a tiny fraction of all reads - the mapping sensitivity is barely affected, but you will consistently lose alignments in highly divergent regions (Gerton first made me think this is important). To get these divergent reads mapped, you need to increase -e/-m if you do not want to do a multi-phase alignment, but in that case, gem will run slower. I could be wrong on the last point. It would be more appropriate to evaluate human variant calls, but the lack of correct left-aligned SAM makes such an experiment hard, for now.

ADD REPLYlink written 7.8 years ago by lh332k

Hi, recently, i have a problem about GEM. When i want build index for human using GEM, it reported Creating sequence and location files...Fatal error: exception Failure("Command 'echo "-------gem-indexer_fasta2meta+cont" > human_g1k_v37.fasta.log && gem-indexer_fasta2meta+cont -i human_g1k_v37.fasta -c dna --filter-function iupac-dna --strip-unknown-bases-threshold 50 --complement-size-threshold 2000000000 --comp.

The command  i used is ./gem-indexer -i /devdata/chhy/masai/bin/human_g1k_v37.fasta -o hum.index -T 4.

Could you help me to solve this problem?

Thanks!

ADD REPLYlink written 5.8 years ago by chhylp1230
3

New ROC are here (PDF).

ADD REPLYlink written 7.7 years ago by lh332k

thanks! BWA still looking good.

ADD REPLYlink written 7.7 years ago by brentp23k

That is the new algorithm released today. The original BWA falls short for single-end. I am still worrying that I have not used gem properly for paired-end.

ADD REPLYlink written 7.7 years ago by lh332k

Sorry for being dumb, but where is the mapper ROC page?

ADD REPLYlink written 7.8 years ago by Dan520

http://lh3lh3.users.sourceforge.net/alnROC.shtml

ADD REPLYlink written 7.8 years ago by brentp23k

Thanks brentp!!

ADD REPLYlink written 7.7 years ago by Dan520
1

Hi lh3, can you add the parameters used for each tool and a description of your inputs (and how you generated them)? This would help a great deal with reproducing your tests.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by Alex Reynolds31k

One comment to segemehl: was the -H option set to 0? (-H, --hitstrategy report only best scoring hits (=1) or all (=0) (default:1))

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k

What about adding STAR?

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k

Well, there are just so many mappers. I am sure if you go through the 40-ish papers, you will find more mappers that are claimed to find all k-mismatch/k-difference hits. It is hard to cover them all. My initial goal is to focus on the most popular mappers. Gem is actually not very popular right now, but I think it has a great potential.

ADD REPLYlink written 7.8 years ago by lh332k

You mention that segemehl has "several" practical concerns with most of its functionality covered by other mappers. Can you please go into some more detail?

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k

EDIT: in short, segemehl: 1) requires too much memory; 2) is very inconvenient to run if we separate the genome for a smaller memory footprint; 3) does not output the best alignment with a proper mapping quality, which is required by many variant callers.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k

The huge memory footprint of segemehl is actually the main drawback (that's not new). But most collaborators I work with have machines with more than 50 GB of main memory. Actually, I think that main memory is no longer a problem nowadays. At least for big institutes. Nevertheless, you are completely right. The needed memory is huge and segemehl is not a good choice to run on your personal computer. When reading your practical concerns (splitting the genome to chromosomes), I'm absolutely with you. Don't use segemehl like that... you WILL go nuts.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by David Langenberger9.5k

Will 64GB be enough? I read from the segemehl manual that you need 6GB for chromosome 1. That is 1/10 of the human genome. In that trend you would need 60GB for segemehl. With disk cache and a couple of other GB jobs on the machine, 64GB might sound pushing the limit.

ADD REPLYlink written 7.8 years ago by lh332k

But despite the memory problem, why would you not recommend segemehl for Illumina reads? That is actually a strong assertion, so please explain!

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k
1

It does not produce SAM containing the best alignments with proper mapping quality. This is the typical input for most variant callers. For a mapper using heuristics, it is non-trivial to derive good mapping quality. In addition, requiring 50GB memory alone is a problem big enough for me to discourage its uses. Not every lab can access those huge memory machines. At the same time, the biggest few institutes use huge clusters, with each computing node accessible to limited shared memory. I would not recommend something that use unnecessary memory for results achievable by many other mappers.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k

Did you test Mosaik?

ADD REPLYlink written 7.8 years ago by Erik Garrison2.3k

This was a while ago, but if people are still interested in this question, try vmatch. It's a general-purpose aligner that can handle this situation for both hamming and edit distance.

ADD REPLYlink written 3.5 years ago by duncanb0
5
gravatar for nkrumm
7.8 years ago by
nkrumm50
nkrumm50 wrote:

One important caveat: MrFAST uses 12bp seeds and a pigeon-hole principle to ensure that alignments with 2 mismatches are guaranteed to be found in 36mer reads or longer (need at least one matching 12mer seed to align). In this case, using a 31bp read and attempting to find 2 mismatches would require a smaller seed to be used (10bp, for example).

ADD COMMENTlink written 7.8 years ago by nkrumm50

Good catch. You are right. I have not really understood MrFAST's algorithm.

ADD REPLYlink written 7.8 years ago by lh332k
5
gravatar for lh3
7.8 years ago by
lh332k
United States
lh332k wrote:

The following are the command lines I was using, found from history. You cannot reproduce exactly the same results, as wgsim uses a random seed by default, which I have not written down. Also note that bwa may miss gapped alignment produced by gem. MrFast seed length is set to 12bp as the default.

wgsim -1 31 -2 31 hs37m.fa r1.fq /dev/null > /dev/null   # this uses a random seed
bwa index hs37m.fa
bwa aln -O3 -E3 -R1000000 -i1 hs37m.fa r1.fq > r1.sai    # gapped alignment; a gap costs the same as a mismatch, which is not the default
bwa samse hs37m.fa r1.sai r1.fq > r1.sam                 # text alignment
bwa aln -o0 -R1000000 hs37m.fa r1.fq > r1.o0.sai         # ungapped alignment
bwa samse hs37m.fa r1.sai r1.fq > r1.o0.sam
perl -ne 'print if /X0:i:(\d+)/&&$1>=10&&/NM:i:2/' r1.o0.sam | awk '{print ">"$1"\n"$10}' > ungapped.fa
perl -ne 'print if /X0:i:(\d+)/&&$1>=10&&/NM:i:2/' r1.sam | awk '{print ">"$1"\n"$10}' > gapped.fa

# for every mapper, the reference index is built with the default settings
gem-mapper -I hs37m.gem -e 0.04 -m 0.08 -s1 -i seq.fa > aln.map
mrfast --search hs37m.fa -e 2 --seq seq.fa -o aln.sam
segemehl.x -i chr1.idx -d chr1.fa -q seq.fa > aln.sam
bowtie -a -f -S -l 31 --sam-nohead hs37m-bt1 seq.fa > aln.sam
bowtie2 -x hs37m-bt2 -a -f seq.fa > aln.sam
ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by lh332k

Would it be possible to add some information how the mappings were evaluated? That way we will be able to test other tools and compare them against top 2.

ADD REPLYlink written 7.8 years ago by Darked894.2k

You just count the number of hits.

ADD REPLYlink written 7.8 years ago by lh332k
3
gravatar for David Langenberger
7.8 years ago by
Deutschland
David Langenberger9.5k wrote:

To find a positive set of hits, we used RazerS 3 (full sensitivity) to find all hits (due to performance reasons, we searched only for up to 3 errors):

razers3 -of sam -rr 100 -i 90 -m 100000000 -o test.razers3 hg19.fa test.fa

Sequence                          RazerS 3
--------------------------------------------------------------------------------------------------------
AGGGCCAGCTAAATTTTGTATTTTTAGAAGA   29 2e; 1764 3e
GTCAGTATGGCCAATTTCACGATAATGATTC   51 2e; 6105 3e
TTTGATTCTTCTGTCTTTTCTTCTTTAATAG   103 1e; 4183 2e; 5842 3e
GCCCCCTGAGCAGCTGGGATTACAGGCGAGC   10 1e; 138 2e; 3287 3e
ATCTTTCTCTTCTTATTTTTGTATGAATATA   10 1e

Furthermore, we called segemehl (0.1.4) on the complete genome (hg19), using the following parameters:

segemehl.x -d hg19.fa -i hg19.idx -q test.fa -o test.segemehl -D 2 -H 0

-D 2: allow up to 2 errors in the seed (default:1); due to the small read length (close to the seed length), it should be set

-H 0: find also suboptimal hits (default:1; best only)

Sequence                          segemehl
--------------------------------------------------------------------------------------------------------
AGGGCCAGCTAAATTTTGTATTTTTAGAAGA   28 2e; 12 3e; 14 4e
GTCAGTATGGCCAATTTCACGATAATGATTC   51 2e; 3 3e; 1 4e
TTTGATTCTTCTGTCTTTTCTTCTTTAATAG   103 1e; 12 2e; 1 3e
GCCCCCTGAGCAGCTGGGATTACAGGCGAGC   10 1e; 24 2e, 24 3e, 37 4e
ATCTTTCTCTTCTTATTTTTGTATGAATATA   10 1e; 1 4e

EDIT: (1) The RazerS 3 counts are corrected (duplicated hits are counted once). (2) The segemehl mappings are now for the 1000g phase2 genome (excluding the last sequence hs37d5) making the results comparable.

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by David Langenberger9.5k

There are two caveats. Firstly you are including ALT contigs while I am not. One hit of the first sequence is from chr6 haplotype. Secondly, I said the first sequence has 27 2-mismatch hits. These exclude gapped alignments, while you are counting gapped alignments. If I run bwa/gem in the gapped mode, they both find 27 2-mismatch and one 1-mismatch-1-indel hit. Bwa/gem both report 51 2-difference hits for the second sequence. They are not missing any true hits.

RazerS is buggy. It is reporting duplicated hits.

ADD REPLYlink written 7.8 years ago by lh332k

Where can I access the genome you used? It's not a problem to run segemehl on your reference genome. Then I can again send you the sam file, so that you can do the counting. It's a good idea of one person does that to assure that it's counted in the same fashion for all tools. And you are right, I did not distinguish between mismatches and indels (just changed the "m" to an "e" for "error"). But I actually never mentioned that any tool misses true hits... :) I will clean the razers3 results and update the table.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by David Langenberger9.5k

I am using the 1000g phase2 genome, excluding the last sequence hs37d5. With your current table giving more hits, it would appear that both bwa and gem are missing a couple of hits. I was just explaining that that was caused by the use of different genomes and different ways of counting mismatches/gaps.

ADD REPLYlink written 7.8 years ago by lh332k
1
gravatar for daiefa123
7.8 years ago by
daiefa123150
Germany
daiefa123150 wrote:

Great work! Can you provide the complete calles you used for the 5 tools? I am planing something similar since several weeks now. Your work could be a great starting point! I will add more tools!

ADD COMMENTlink written 7.8 years ago by daiefa123150
1

I just simulated reads, mapped them with bwa and then chose the reads I need. Note that bwa is not configured to find hits given a edit distance threshold. You need "-E3 -O3 -R100000 -i1". Gem does not find suboptimal and gapped hits by default. For short reads, you should use "-s1 -e 0.04 -m 0.08". For bowtie1, you need to increase the seed length to the read length to effectively disable seeding.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k
2

That is exactly what I want. Can you please write down the complete used parameter sets for all 5 tools? bowtie, bowtie2, mrfast, bwa and segemehl. I just need these to reproduce your results. Thank you!

ADD REPLYlink written 7.8 years ago by daiefa123150

So, I am out of this discussion. lh3 seems not to be able to say what parameters were used for bowtie, bowtie2, mrfast and segemehl. The way he writes is too much agressive for me. That is not the way I want to do research. See you guys in other posts!

ADD REPLYlink written 7.8 years ago by daiefa123150

Haven't you seen my last post about all the command lines?! Please read first before making a strong assertion.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k
1

Sorry, I did not see it. But still, I do not like your agressive style. Who are you? The mapping god? You judge other tools in a very unkind way. Yeah, bwa rocks. I think we all see that now (with hundreds of special parameters at least). I do not like these fights. Have fun.

ADD REPLYlink written 7.8 years ago by daiefa123150
9

"Who are you? The mapping god?" Yes @daiefa123, you might say that...

ADD REPLYlink written 7.8 years ago by Daniel Standage4.0k

If some sentences look aggressive to you, I am sorry. I admit I am quite aggressive towards segemehl in my new comment to David which you had not seen yet when you posted your comment, but I did not mean to be aggressive to mrfast, gem and razers, which I think are all good. I openly said that they will do a better job than bwa for a bit longer reads and applauded gem for its more advanced algorithm. I wonder which sentences make you feel I am aggressive to mrfast, gem or razers? I am willing to change those as I did not really mean to. As to the bwa parameter uses, the default is chosen for typical variant calling, but not for finding all the suboptimal hits within certain edit distance. Those outputting all hits by default cannot work with most widely used variant callers. In addition, all the other mappers require to change the default settings for this data set and/or for gapped alignment. Bwa is not alone.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k
1

You shouldn't be agressive at all! Not even against segemehl! ;) I think this is a nice place for discussions here and we should use it for that. No fighting! I hope your comments are not meant personal. I also think your comments are a bit....well...harsh. But I think it's a discussion and everyone has to have the right to love his tool!

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k

There are 40-ish mappers. I feel it is pointless to go through each of them without making recommendations: without recommendations, many endusers will still be in dark about what mappers they should use for a particular task. When I make recommendations, there will be good and bad. Nonetheless, I guess what I should not do is to say "I would not recommend xxx", which will hurt someone. Sorry for that and I have deleted that sentence.

ADD REPLYlink modified 7.8 years ago • written 7.8 years ago by lh332k

Don't worry... I actually like the discussion.

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k

right, he actually wrote them.

ADD REPLYlink written 7.8 years ago by David Langenberger9.5k
1

ummm... I was just thinking about it (it might be just a typo, my apologizes if this remark is out of context). Using GEM with "-e 0.04 -m 0.08" seems to me not to be the most appropriate. You're requesting GEM to be exhaustive up to 8% mismatches, but then you are restricting the valid alignments up to 4% (making GEM invest more time to finally drop many of them, >4%). So, as a rule e>=m and I would try with something more like "-m 0.04 -e 0.08". These are very short reads, but speaking of which, if time is really a concern (and times using these cmd lines are not acceptable) it might be a good idea to explore "--fast-mapping=0" and "--fast-mapping=adaptive" modes. Their effect might probably be more remarkable as the read length increases (75nt,100nt,150nt,...), but I guess it's an approach its worthwhile exploring.

ADD REPLYlink written 7.6 years ago by santiagomsola10
1
gravatar for pan.alexiou
7.7 years ago by
pan.alexiou10
pan.alexiou10 wrote:

Interesting, however when using bwa to find all suboptimal alignments by changing the -n option, I get more alignments for -n 5 than for -n 7 which seems wrong to me (for the same read). Any comment on that?

I've explained more Finding all sub-optimal alignments with bwa

ADD COMMENTlink modified 7.7 years ago by Istvan Albert ♦♦ 85k • written 7.7 years ago by pan.alexiou10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2127 users visited in the last hour