Question: Ultrafast alignment of short sequences to a genome (with up to 5 mismatches!)
1
gravatar for Aurelie MLB
4.9 years ago by
Aurelie MLB320
United Kingdom
Aurelie MLB320 wrote:

Hello,

I would like to be able to align around 500 short sequences (~20 nucleotides) to the human genome in a super quick way.

I would like to authorize up to 5 mismatches but no gaps. I also would like to retrieve all possible hits for one sequence... not just the best ones. Then, I would like to be able to annotate the hits.

I will need to repeat this work very often.

I investigated the NGS tools like bowtie or BWA to see if I could tweak them but it seems that the number of mismatches are limited to 3 and 1 respectively...this is a pity because it would have been great to have the results in a SAM format to do the follow up annotation step!

Could anyone advise me one a very fast aligner to do what I would like please?

Many thanks!

 

 

alignment genome • 2.7k views
ADD COMMENTlink modified 4.7 years ago by Biostar ♦♦ 20 • written 4.9 years ago by Aurelie MLB320

did you play with the options of bwa ?

       -B INT        penalty for a mismatch [4]
(...)
ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum121k

I got curious about this... Actually it seems that persuading bwa mem to align very short reads it's not easy... Here's a little test I performed on hg19 genome. One read is 20bp with no mismatches, the other has just 1 mismatch. I can't get bwa mem to align the second read!

mem genome.fa oligos.fa -T 0 -B 1
[M::main_mem] read 2 sequences (40 bp)...
[M::mem_process_seqs] Processed 2 reads in 0.000 CPU sec, 0.003 real sec
no_mism    0    chrM    1    6    20M    *    0    0    GATCACAGGTCTATCACCCT    *    NM:i:0    MD:Z:20    AS:i:20    XS:i:0
one    4    *    0    0    *    *    0    0    GCTCACAGGTCTATCACCCT    *    AS:i:0    XS:i:0

(the second read has 1 mismatch on the second base)

I set -B and -T to be very permissive but I had no luck in aligning both reads! (I played a bit also with -w, no better)

 

 

ADD REPLYlink written 4.9 years ago by dariober10k

Thanks for trying Dariober!

I do not know well BWA so I do not get the out put very well... but could it be that he only returns the best alignment?

ADD REPLYlink written 4.9 years ago by Aurelie MLB320

Yes, by default bwa returns the best alignment and in case of ties returns one of the best hits at random. The point here is that the second read is reported as unmapped while I expected to be mapped at the same location as the first one.

ADD REPLYlink written 4.9 years ago by dariober10k

bwa is not a generic aligner, it was not designed to work for very short reads, bowtie does however work to some extent for miRNA sized reads (18-24bp)

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 80k

Indeed, I just thought with some massaging of the parameters it could be done... But as I said below, an NGS aligner might not be the best choice...

ADD REPLYlink written 4.9 years ago by dariober10k

Hello

Thanks for answering!

Would you have an idea of how to play with this parameter please? I do not see how I could control the number of mismatches from this...

Thanks a lot

ADD REPLYlink written 4.9 years ago by Aurelie MLB320

Maybe NGS tools are not the way to go...

They just sound interesting in the first place. But apparently they are not designed to provide all possible alignement hits  for a given sequence... (?)

 

ADD REPLYlink written 4.9 years ago by Aurelie MLB320

I was going to suggest that NGS aligners are not the best choice... As a side note, keep in mind that allowing 5 mismatches in 20bp and asking for all the hits is probably going to generate millions of records in output.

One option I would try is to use a general aligment tool like vmatch.

ADD REPLYlink written 4.9 years ago by dariober10k

The problem is that the more mismatches you allow in a short read the more diverse the read becomes and more locations it can align to. That is not what most fast mappers were designed to do, these aligners want to find the single best location. You can't have both fast alignments and very flexible ones. It is very unlikely that you could tweak these aligners to do what you want.

maybe you could generate all possible mismatches with a program then look for exact matches with the aligner.
 

ADD REPLYlink written 4.9 years ago by Istvan Albert ♦♦ 80k

Thanks Albert! That is an idea. I would have to think about it though. It sounds that this would have a high complexity and I might gain time for the alignment but spend a lot of time generating all the possibilities. I would not know how to do this efficiently..

On a different note, I was looking at MegaBlast... would someone know this one. Could it help?

ADD REPLYlink written 4.9 years ago by Aurelie MLB320

It says: 

Megablast: This program is optimized for aligning sequences that differ slightly

You are looking for 5 mismatches in 20 reads, that is 25% different, in sequencing that is actually a lot! plus it would be a local alignment, likely not what you are looking for.
 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Istvan Albert ♦♦ 80k
Ok. Thanks for your answer! I will look for something else then. I am so surprised that nothing seems to exist to do that. It does not feel like something very special to do...
ADD REPLYlink written 4.9 years ago by Aurelie MLB320

As the manual says, bwa-mem is not intended for very short reads. With bwa-aln, you need something like "bwa aln -n5 -k5 -o1". It will be very slow, but for 500 sequences, it should be find. Beware that you are likely to get a huge pile of random hits with 5 mismatches in a 20bp sequence.

ADD REPLYlink written 4.7 years ago by lh331k

Check out the Bitap algorithm, it looks like this is what you need

ADD REPLYlink written 4.7 years ago by mikhail.shugay3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 943 users visited in the last hour