Hi I have an amplicons library with a few hundred targets that are similar, I expect no gaps, and for simplicity I expect no mismatches (so I penalize).
I use bowtie2 2.4.1 with the following command:
bowtie2 -a -D 100 -R 5 --local --no-unal --gbar 800 --ma 2 --mp 50,50 --np 50 --rdg 50,30 --rfg 50,50 --threads 1 -x ../library/index/libidx -U /scratch/file.fastq -S /scratch/out.sam
I look at alignments with NM:i:0, the problem is that '-a' doesn't return all possible alignments (I tried with and without '-D', '-R' options). When I look manually (simple search w/o MM) for some aligned sequence that it finds, I see there are many more alignments that it missed, sometimes there is only 1 alignment listed with higher score and many alignments with lower score. I know that bowtie2 wasn't designed for these situations in terms of performance, but so far it was fast enough for me.
Is it a well known problem? is there an undocumented limitation?
in the example SAM output below (single read, not all alignments, NM:i:0 alignments, no header, sequences+QS columns removed), all reads should be mapped to WT_Reference (only the WT was sequenced in this case), but there are many other alignments Variant_20, with CIGAR=388M197S, is the best alignment with NM:i:0, it is found 21 times (including WT_Reference) but is listed only once. The library size is 205 sequences (WT_Reference + Variant_1...Variant_204)
thanks, Shim
@PG ID:bowtie2 PN:bowtie2 VN:2.4.1 "CL:""/Tools/Bowtie/bowtie2-2.4.1-linux-x86_64/bowtie2-align-s --wrapper basic-0 -a -D 100 -R 5 --local --gbar 800 --ma 2 --mp 50,50 --np 50 --rdg 50,30 --rfg 50,50 --threads 1 -x ../library/index/libidx --passthrough -U /scratch/file.fastq"""
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_20: 602 255 388M197S * 0 0 AS:i:776 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:388 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_25: 602 255 161M424S * 0 0 AS:i:322 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:161 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_26: 602 255 160M425S * 0 0 AS:i:320 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:160 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_21: 602 255 137M448S * 0 0 AS:i:274 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:137 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_77: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_76: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_72: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_74: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_78: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_75: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_73: 602 255 79M506S * 0 0 AS:i:158 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:79 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_172: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_175: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_136: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_170: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_171: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_135: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_139: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_140: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_173: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_176: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_137: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_174: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_138: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_141: 602 255 68M517S * 0 0 AS:i:136 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:68 YT:Z:UU
M01015:88:000000000-J5K85:1:1101:12543:1681 256 Variant_185: 602 255 67M518S * 0 0 AS:i:134 XS:i:910 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:67 YT:Z:UU
...
Short read aligners were never designed to be mathematically optimal or to produce all equally scoring alignments. Thus I would not recommend using that for this type of use cases.
Thank you Istvan, I'm aware short read aligners don't produce all possible alignments, but I do expect a high coverage of the best alignments (e.g. > 95%; and missing 20/21 of a particular score' alignments is a problem). Often there are "Effort options" (like bowtie2's '-D', '-R') that should produce more results probably at a significant performance cost, but, as I mentioned, that didn't help.
Do you know a tool that better suits my use case? i.e. mapping many fastq short reads to < 300 variants that are similar, it would be best if it output all the best-rank-alignments (similar to bowtie1's -a --best --strata).
PS my analysis is somewhat similar to microbiome analysis, but so far I found only mothur tool (Schloss, et al., 2009) which I'm not sure suits my pipeline..
best, shim
Try
bbmap.sh
as an alternate. It has aambig=all
option that should report all possible alignments for a read. There is a guide available. There are many options available on command line. You will need to spend some time experimenting.Thank you very much genomax, I will try bbmap (I saw it in microbiome pipeline somewhere but never used it)