I would like to look through a custom set of FASTA data (say, sequences for gene proximal promoters) for the best match to an input 18mer, returning matches sorted by score. Is BLAT a good tool for this task, or should I use a different approach? If I use BLAT, what are good options to give it to search for 18mers? Apologies in advance if this is a dumb question.
The Blat FAQ has lots of useful information for running Blat. The section Using Blat for short sequences with maximum sensitivity recommends, amongst other things:
- The formula to find the shortest query size that will guarantee a match (if matching tiles are not marked as overused) is: 2 * stepSize + tileSize - 1
- Try using -fine
- Use a large value for repMatch (e.g. -repMatch = 1000000) to reduce the chance of a tile being marked as over-used
So I grabbed a FASTA file of chromosome 1, copy/pasted a chunk of it and made some 18-mers:
>1 GACTTCTTGAATGTATTG >2 TTAGCCATTTAACCACAC >3 TGTGTTGTTTCTCATTCT >4 ACCTGTAGAATCTCAAAG >5 TTCTTTCCCACTTCTATA
And ran Blat a few times with parameters such as:
blat chr1.fa 18mer.fa -stepSize=6 -tileSize=7 -fine -repMatch=1000000 out.psl blat chr1.fa 18mer.fa -stepSize=2 -tileSize=15 -fine -repMatch=1000000 out.psl
No matches were found. However, matches were found by adding:
So - you can get results using Blat, but it is not the most appropriate tool to use.
I would say that BLAT is not the right tool. BLAT is made for searching nearly exact matches and uses a seed that has to be 100% similar (Bowtie also uses the seed), if the two sequences don't have a long enough matching region you won't get any result. The tool for the mission is the basic Smith-Waterman local alignment algorithm implemented in "water" of the EMBOSS package or in the NCBI BLAST website. However, this will only give you the best hit. Another tool that might be used is MAST from the MEME suite. You will have to generate a relaxed PSSM that represents the short sequence, allowing for mismatches. You can do it manually or manipulate MEME to generate such a motif by giving the sequence several times as input along with poly A-C-G-T sequences in the same length, limit the size of the motif to the size of the input sequence and choose oops mode - exactly one match per sequence. MAST will then search for the relaxed sequence motif in the long sequence.
For short RNAs I would recommend segemehl.
It has two pros for small RNAs:
- It allows not only mismatches, but also indels in the seed (by default)
- It finds all possible equally good hits within the reference sequence (by default a paramter is set to an upper bound of 100 hits, before the read will be omitted)
Point 1 is important, since small ncRNAs tend to occur in several copies (e.g. hsa-mir-16-1 and hsa-mir-16-2 have the same sequence, but are encoded on two different chromosomes). If you do not allow multiple mappings and discard these reads, you will end up with misleading results.
And bowtie actually has some problems with short RNAseq data and multiple hits. I would recommend to read Attention: Bowtie2 And Multiple Hits.
So, if your sequences can occur several times, or tend to have mistakes, I would recommend to set the correct parameters in bowtie (as explained in the mentioned thread), or use another tool.
BLAT tends to "oversplice" alignments, such that if there is a way to arrive at an otherwise perfect alignment with a 200bp gap between canonical splice sites, it will prefer that over something with an outright internal mismatch.
You could use Vmatch for this. Vmatch employs suffix trees, and is non-heuristic, meaning it is guaranteed to find all hits of a certain threshold. It also has more edit and hamming distance parameters than typical NGS aligners.
I think bowtie (version 1) might work well for this. I know that some of my colleagues are using bowtie to find micro RNAs that are just a few bases longer than your targets. Limitations are that is really only suited to find exact matches or up to a few mismatches.