I would like to look through a custom set of FASTA data (say, sequences for gene proximal promoters) for the best match to an input 18mer, returning matches sorted by score. Is BLAT a good tool for this task, or should I use a different approach? If I use BLAT, what are good options to give it to search for 18mers? Apologies in advance if this is a dumb question.
I would say that BLAT is not the right tool. BLAT is made for searching nearly exact matches and uses a seed that has to be 100% similar (Bowtie also uses the seed), if the two sequences don't have a long enough matching region you won't get any result. The tool for the mission is the basic Smith-Waterman local alignment algorithm implemented in "water" of the EMBOSS package or in the NCBI BLAST website. However, this will only give you the best hit. Another tool that might be used is MAST from the MEME suite. You will have to generate a relaxed PSSM that represents the short sequence, allowing for mismatches. You can do it manually or manipulate MEME to generate such a motif by giving the sequence several times as input along with poly A-C-G-T sequences in the same length, limit the size of the motif to the size of the input sequence and choose oops mode - exactly one match per sequence. MAST will then search for the relaxed sequence motif in the long sequence.
Thanks for the pointers — I actually do want to look for mismatches, as well. I'll probably set up a local BLAST+ environment, putting mismatches and indels into output categories, and see how that goes.
BLAST uses seed matching as well to save time, the BLAST server has an implementation of Needelman-Wunch algorithm for global alignment that will give pretty much the same results. If you do use BLAST set the seed as small as possible
Is there a way to enforce a minimum number of mismatches in BLAST, instead of searching for exact matches? Also, blastn-short is not working too well with shorter sequences, even with adjustments to the E-value and word length (seed) - is BLAST not ideal for shorter sequences?
You should really use a short-read mapper. Such mappers are designed specifically for your type of data. Blast and blat are not the right tools for this task.
It allows not only mismatches, but also indels in the seed (by
default)
It finds all possible equally good hits within the reference
sequence (by default a paramter is set to an upper bound of 100 hits, before the read will be omitted)
Point 1 is important, since small ncRNAs tend to occur in several copies (e.g. hsa-mir-16-1 and hsa-mir-16-2 have the same sequence, but are encoded on two different chromosomes). If you do not allow multiple mappings and discard these reads, you will end up with misleading results.
So, if your sequences can occur several times, or tend to have mistakes, I would recommend to set the correct parameters in bowtie (as explained in the mentioned thread), or use another tool.
Firstly, your old thread is about bowtie2, not bowtie. Bowtie2 is probably not the right tool anyway for 18bp reads (it is designed for 454 and longer Illumina reads) and we do not know how bowtie behaves, yet. Secondly, I guess Alex wants to see all hits. In that case, it does not matter whether bowtie has a problem with reporting a random hit. Thirdly, if Alex is working with a mammalian genome, most hits with 2+ mismatches/indels are likely to be random hits. We do not need an ultrasensitive mapper.
You're absolutely right. But that's what I actually said in my last sentence, no? ;) I'm not sure, what exactly Alex is up to. If he wants to see, how stable (unique) an 18mer is, I would say, that allowing 2+ errors is a good way to check that. Most mapping algorithms don't allow indels in their seeds, which is a big problem, when your sequence length is close to the seed length. You will only get perfect hits, or miss the mapping completely.
After you strongly recommended segemehl several times, I decided to give it a try. I think it is not the right tool for this task. I am not sure how to configure it the best way, but at least with the default setting, segemehl only reports the optimal hits, misses some optimal ones and is memory hungry and inconvenient to use. For very short reads, bwa and gem are better by far. Segemehl may be good for >100bp reads, but even so, the inconvenience may push away a lot of potential users.
lh3, you said, that segemehl misses some optimal hits. Since I use this tool for most of my analyses, I want to isolate the problem in order to see where the problem is. I spend some time searching for these hits (by comparing the mappings with these from other tools), but I did not find such an example. Can you send me one or two of these reads, so that I can check them? Thanks a lot!
Ah ok, I haven't made the connection with the other post. I think you're talking about the second read in your list. segemehl calls only 30 with 2 mismatches and bowtie 50. I'll check where the problem is. But you should really write down the used parameters... otherwise it's a bit like comparing apples with oranges, no? The tools you know good get better results, because you know how to set the parameters correctly. Anyway... I think it is a great idea to open a thread comparing mapping algorithms... especially with short reads! :) And now I stop the discussion here and switch to your post... :)
BLAT tends to "oversplice" alignments, such that if there is a way to arrive at an otherwise perfect alignment with a 200bp gap between canonical splice sites, it will prefer that over something with an outright internal mismatch.
You could use Vmatch for this. Vmatch employs suffix trees, and is non-heuristic, meaning it is guaranteed to find all hits of a certain threshold. It also has more edit and hamming distance parameters than typical NGS aligners.
For 18-35bp sequences, many short-read aligners are non-heuristic. Eland, RMAP, MOM and SOAP2 among many others guarantee to find up to 2-mismatch hits. BWA guarantees to find hits within certain edit-distance. For longer sequences with more allowed distance, these non-heuristic algorithms are inefficient, but even in that case, there is GEM which guarantees to find all hits within certain hamming distance at least for ~100bp reads. As to vmatch, I am actually not sure if it is non-heuristic for non-exact matches. What worries me is that it has this blast-like X drop off parameter. Its manual and website do not state it guarantees to find all non-exact hits, either. Nonetheless, it is possible that X drop off is not used for short reads or I may have missed something important in the documentations. In that case, I apologize and will correct myself later.
I think bowtie (version 1) might work well for this. I know that some of my colleagues are using bowtie to find micro RNAs that are just a few bases longer than your targets. Limitations are that is really only suited to find exact matches or up to a few mismatches.
you get the CIGAR string and MD tags in BAM. Alas the mapping quality that it produces is probably not that useable (bowtie2 might also work just that I never tried). Perhaps someone else knows of a better approach.
Thanks for the pointers — I actually do want to look for mismatches, as well. I'll probably set up a local BLAST+ environment, putting mismatches and indels into output categories, and see how that goes.
BLAST uses seed matching as well to save time, the BLAST server has an implementation of Needelman-Wunch algorithm for global alignment that will give pretty much the same results. If you do use BLAST set the seed as small as possible
Is there a way to enforce a minimum number of mismatches in BLAST, instead of searching for exact matches? Also,
blastn-short
is not working too well with shorter sequences, even with adjustments to the E-value and word length (seed) - is BLAST not ideal for shorter sequences?You should really use a short-read mapper. Such mappers are designed specifically for your type of data. Blast and blat are not the right tools for this task.