Is Blat A Good Choice To Find 18Mers In A Custom Fasta File
5
2
Entering edit mode
9.5 years ago

I would like to look through a custom set of FASTA data (say, sequences for gene proximal promoters) for the best match to an input 18mer, returning matches sorted by score. Is BLAT a good tool for this task, or should I use a different approach? If I use BLAT, what are good options to give it to search for 18mers? Apologies in advance if this is a dumb question.

blat fasta search database • 4.9k views
3
Entering edit mode
9.5 years ago
Neilfws 49k

The Blat FAQ has lots of useful information for running Blat. The section Using Blat for short sequences with maximum sensitivity recommends, amongst other things:

• The formula to find the shortest query size that will guarantee a match (if matching tiles are not marked as overused) is: 2 * stepSize + tileSize - 1
• Try using -fine
• Use a large value for repMatch (e.g. -repMatch = 1000000) to reduce the chance of a tile being marked as over-used

So I grabbed a FASTA file of chromosome 1, copy/pasted a chunk of it and made some 18-mers:

>1
GACTTCTTGAATGTATTG
>2
TTAGCCATTTAACCACAC
>3
TGTGTTGTTTCTCATTCT
>4
ACCTGTAGAATCTCAAAG
>5
TTCTTTCCCACTTCTATA


And ran Blat a few times with parameters such as:

blat chr1.fa 18mer.fa -stepSize=6 -tileSize=7 -fine -repMatch=1000000 out.psl
blat chr1.fa 18mer.fa -stepSize=2 -tileSize=15 -fine -repMatch=1000000 out.psl


No matches were found. However, matches were found by adding:

-minIdentity=0 -minScore=0


So - you can get results using Blat, but it is not the most appropriate tool to use.

2
Entering edit mode
9.5 years ago
Asaf 8.9k

I would say that BLAT is not the right tool. BLAT is made for searching nearly exact matches and uses a seed that has to be 100% similar (Bowtie also uses the seed), if the two sequences don't have a long enough matching region you won't get any result. The tool for the mission is the basic Smith-Waterman local alignment algorithm implemented in "water" of the EMBOSS package or in the NCBI BLAST website. However, this will only give you the best hit. Another tool that might be used is MAST from the MEME suite. You will have to generate a relaxed PSSM that represents the short sequence, allowing for mismatches. You can do it manually or manipulate MEME to generate such a motif by giving the sequence several times as input along with poly A-C-G-T sequences in the same length, limit the size of the motif to the size of the input sequence and choose oops mode - exactly one match per sequence. MAST will then search for the relaxed sequence motif in the long sequence.

0
Entering edit mode

Thanks for the pointers — I actually do want to look for mismatches, as well. I'll probably set up a local BLAST+ environment, putting mismatches and indels into output categories, and see how that goes.

0
Entering edit mode

BLAST uses seed matching as well to save time, the BLAST server has an implementation of Needelman-Wunch algorithm for global alignment that will give pretty much the same results. If you do use BLAST set the seed as small as possible

0
Entering edit mode

Is there a way to enforce a minimum number of mismatches in BLAST, instead of searching for exact matches? Also, blastn-short is not working too well with shorter sequences, even with adjustments to the E-value and word length (seed) - is BLAST not ideal for shorter sequences?

0
Entering edit mode

You should really use a short-read mapper. Such mappers are designed specifically for your type of data. Blast and blat are not the right tools for this task.

2
Entering edit mode
9.5 years ago

For short RNAs I would recommend segemehl.

It has two pros for small RNAs:

1. It allows not only mismatches, but also indels in the seed (by default)
2. It finds all possible equally good hits within the reference sequence (by default a paramter is set to an upper bound of 100 hits, before the read will be omitted)

Point 1 is important, since small ncRNAs tend to occur in several copies (e.g. hsa-mir-16-1 and hsa-mir-16-2 have the same sequence, but are encoded on two different chromosomes). If you do not allow multiple mappings and discard these reads, you will end up with misleading results.

And bowtie actually has some problems with short RNAseq data and multiple hits. I would recommend to read Attention: Bowtie2 And Multiple Hits.

So, if your sequences can occur several times, or tend to have mistakes, I would recommend to set the correct parameters in bowtie (as explained in the mentioned thread), or use another tool.

0
Entering edit mode

Firstly, your old thread is about bowtie2, not bowtie. Bowtie2 is probably not the right tool anyway for 18bp reads (it is designed for 454 and longer Illumina reads) and we do not know how bowtie behaves, yet. Secondly, I guess Alex wants to see all hits. In that case, it does not matter whether bowtie has a problem with reporting a random hit. Thirdly, if Alex is working with a mammalian genome, most hits with 2+ mismatches/indels are likely to be random hits. We do not need an ultrasensitive mapper.

0
Entering edit mode

You're absolutely right. But that's what I actually said in my last sentence, no? ;) I'm not sure, what exactly Alex is up to. If he wants to see, how stable (unique) an 18mer is, I would say, that allowing 2+ errors is a good way to check that. Most mapping algorithms don't allow indels in their seeds, which is a big problem, when your sequence length is close to the seed length. You will only get perfect hits, or miss the mapping completely.

0
Entering edit mode

After you strongly recommended segemehl several times, I decided to give it a try. I think it is not the right tool for this task. I am not sure how to configure it the best way, but at least with the default setting, segemehl only reports the optimal hits, misses some optimal ones and is memory hungry and inconvenient to use. For very short reads, bwa and gem are better by far. Segemehl may be good for >100bp reads, but even so, the inconvenience may push away a lot of potential users.

0
Entering edit mode

Ok, I think I got your point. ;)

0
Entering edit mode

lh3, you said, that segemehl misses some optimal hits. Since I use this tool for most of my analyses, I want to isolate the problem in order to see where the problem is. I spend some time searching for these hits (by comparing the mappings with these from other tools), but I did not find such an example. Can you send me one or two of these reads, so that I can check them? Thanks a lot!

0
Entering edit mode
0
Entering edit mode

Ah ok, I haven't made the connection with the other post. I think you're talking about the second read in your list. segemehl calls only 30 with 2 mismatches and bowtie 50. I'll check where the problem is. But you should really write down the used parameters... otherwise it's a bit like comparing apples with oranges, no? The tools you know good get better results, because you know how to set the parameters correctly. Anyway... I think it is a great idea to open a thread comparing mapping algorithms... especially with short reads! :) And now I stop the discussion here and switch to your post... :)

2
Entering edit mode
9.5 years ago

BLAT tends to "oversplice" alignments, such that if there is a way to arrive at an otherwise perfect alignment with a 200bp gap between canonical splice sites, it will prefer that over something with an outright internal mismatch.

You could use Vmatch for this. Vmatch employs suffix trees, and is non-heuristic, meaning it is guaranteed to find all hits of a certain threshold. It also has more edit and hamming distance parameters than typical NGS aligners.

0
Entering edit mode

For 18-35bp sequences, many short-read aligners are non-heuristic. Eland, RMAP, MOM and SOAP2 among many others guarantee to find up to 2-mismatch hits. BWA guarantees to find hits within certain edit-distance. For longer sequences with more allowed distance, these non-heuristic algorithms are inefficient, but even in that case, there is GEM which guarantees to find all hits within certain hamming distance at least for ~100bp reads. As to vmatch, I am actually not sure if it is non-heuristic for non-exact matches. What worries me is that it has this blast-like X drop off parameter. Its manual and website do not state it guarantees to find all non-exact hits, either. Nonetheless, it is possible that X drop off is not used for short reads or I may have missed something important in the documentations. In that case, I apologize and will correct myself later.

1
Entering edit mode
9.5 years ago

I think bowtie (version 1) might work well for this. I know that some of my colleagues are using bowtie to find micro RNAs that are just a few bases longer than your targets. Limitations are that is really only suited to find exact matches or up to a few mismatches.

0
Entering edit mode

Thanks. Is there a way to rank output from bowtie?

1
Entering edit mode

you get the CIGAR string and MD tags in BAM. Alas the mapping quality that it produces is probably not that useable (bowtie2 might also work just that I never tried). Perhaps someone else knows of a better approach.