Question: Odd max query limit in BLAT
0
gravatar for juan.crescente
15 months ago by
juan.crescente20 wrote:

I'm running BLAT with this command:

blat genome.fasta query.fasta blat_90.csv -minIdentity=90 -noHead

And I'm surprised that I've counted the results and order by count.

import pandas as pd
import sys
df_blat = pd.read_csv('blat_90.csv', index_col=False, sep='\t')
s = df_blat.qName.value_counts()
count = 1
for k in s.head(100).index.tolist():
    count += 1
    print(s[k],k)

The interesting thing is that the max results per query is 672 and this number is repeated about 30 times. Then I have queries that were repeated 671 times again many times.

672 MITE_T_71978|chr3D|340455280|340455443|AT|175|F2975
672 MITE_T_97305|chr7B|194283260|194283371|AT|29|F4298
672 MITE_T_110543|chr5D|34445518|34445608|TA|94|F5023
672 MITE_T_72475|chr3D|503092467|503092630|AT|174|F2988
...

So this makes me wonder if there's a limit in the number of times a query is search. I've repeated this script with no different results.

UPDATE: I've notice that running only one query with each wheat chromosome in a separate thread, the max results per search is 32.

This is how the results look like when I search only one sequence agains each chromosome at the time:

 wc -l b*
           32 b1A.csv
           32 b1B.csv
           32 b1D.csv
           32 b2A.csv
           32 b2B.csv
...

Max 32 hits per search (accordingly 32 x 21 chromosomes is 672)

There must be a limit in the code but I'm still not able to find it

blat • 501 views
ADD COMMENTlink modified 15 months ago • written 15 months ago by juan.crescente20

Considering Jim Kent is author of blat (and how long blat has been around) it seems very unlikely that you are running into some odd bug in blat. Do you get a different number if you increase the tile size (-tileSize=20)?

Have you tried taking a sequence from human genome and testing it against that? I will try it out later today.

ADD REPLYlink written 15 months ago by genomax70k

default tileSize of 11 is what I need in this case, I'll give a try (I do not understand the reason, and max is 18). Why a sequence from human? I'm using wheat genome. Also I did not know who Jim Kent is till now, but I guess any programmer could make mistakes.

ADD REPLYlink modified 15 months ago • written 15 months ago by juan.crescente20

Human was a test. Just something other than wheat in case that genome has some strange replicative characteristic you were picking up in this search.

ADD REPLYlink written 15 months ago by genomax70k

changing the tileSize to 17 gives 3.240.017 results vs the 32 in the original command.

ADD REPLYlink written 15 months ago by juan.crescente20

So 672 is not a hard limit in blat. You are observing some characteristic in your data. I am not sure what kind of data this is.

ADD REPLYlink written 15 months ago by genomax70k

Indeed. Oddly a larger tileSize gives more result that default. It is the wheat genome, a large and repetitive one.

ADD REPLYlink written 15 months ago by juan.crescente20

I was just thinking that too. Increasing the tile size should have given fewer results. This result would indicate that the increased tile size is allowing more initial seed alignments?

ADD REPLYlink written 15 months ago by genomax70k

Could be, sounds like an undocumented feature (or an issue) since 3M results vs just 32 does not makes sense.

ADD REPLYlink written 15 months ago by juan.crescente20

I've realized that specifing even default value (11) gives results as excected. Thanks for the advise.

ADD REPLYlink written 15 months ago by juan.crescente20

@Juan: Is there still a question that is outstanding? I assume it is "the max results per search is 32".

Are you using the latest blat available? Did you get the precompiled executables or compiled it yourself?

ADD REPLYlink written 15 months ago by genomax70k

I've used the latest and compile it myself

https://users.soe.ucsc.edu/~kent/src/blatSrc35.zip

Indeed I've realized that specifying the tileSize does not solve the issue

Also tried with

http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.36.zip

nohup ./blat /media/crescentejuan/Data/iwgsc_by_chr/chr1A.fasta  ../blat/query.fa /media/crescentejuan/Data/blat_built_demo.csv  -minIdentity=95 -noHead -tileSize=11  &

and also got:

wc -l blat_built_demo.csv
32 blat_built_demo.csv
ADD REPLYlink modified 15 months ago • written 15 months ago by juan.crescente20

I am going to suggest that you either directly write to Jim Kent (you should be able to find his email at UCSC and as an aside you should know who he is) or post to the UCSC genome browser list about this problem.

ADD REPLYlink written 15 months ago by genomax70k
2
gravatar for juan.crescente
15 months ago by
juan.crescente20 wrote:

As responded by UCSC itself:

The 32 results are indeed a consistent limitation where BLAT only allows 16 results per chromosome and per strand. BLAT is unable to exhaustively return results for queries with repetitive sequences that map to many places in the genome. There may be many other sequence aligning programs available that do not have such limits, but we do not have any specific recommendations about which to use.

ADD COMMENTlink modified 15 months ago • written 15 months ago by juan.crescente20

Thank you for this. Glad there is a logical explanation.

ADD REPLYlink written 15 months ago by genomax70k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1671 users visited in the last hour