Question

How To Scale A Pssm For Ncbi Blast

5

Entering edit mode

13.8 years ago

Michael Schubert ★ 7.1k

Hi again,

I'm trying to do a local NCBI BLAST search using the PSSM of a conserved CDD domain (.smp file in the database, cave LARGE file).

My database is containing nucleotide sequences and the PSSM is for proteins, so I would like to use tblastn for this search. The tblastn executable accepts PSSMs, so no problem there.

But: the matrix in the .smp file is scaled by a factor of 100; according to the specs, the BLAST program should be able to downscale the matrices back to a factor of 1 automatically.

If I'm running tblastn however, I get the following error:

BLAST engine error: PSSM has a scaling factor of 100. PSI-BLAST does not accept scaled PSSMs

Now, is there either a way to run tblastn with scaled matrices or a tool to scale them? And if not, do I just scale the matrix values by 1/100 or the lambda, kappa and h factors (whatever these are) as well?

blast pssm • 5.6k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Michael Schubert ★ 7.1k

Ram · Answer 1 · 2010-07-24

2

Entering edit mode

13.7 years ago

Lars Juhl Jensen 11k

I am not aware of any tool that can do what you want to do. This is what I would do to eliminate the scaling:

Change the scaling factor from 100 to 1
Divide all weights in the PSSM by 100
Possibly multiply lambda by 100
Keep kappa and h unchanged

The part about the scaling factor and the weights should be self-explanatory (and you indeed suggested to do the same yourself).

The reason for possibly having to scale lambda is that it is used in the Karlin-Altschul formula: E = kappa*N*exp(-lambda*S). The big question is if BLAST corrects S for the scaling factor before plugging it into this formula. If it does, you should not change lambda. If it does not, you need to multiply lambda by 100 when dividing all the weights in the PSSM by 100. I sadly do not know the finer details of BLAST well enough to know which is the case.

The way kappa is used in the formula means that it should not be affected by the scaling factor. The parameter H is the entropy, which is used for calculating the effective length of query sequences and sequences in the database. The formula used is effective_length = max(length-lambda*S/H, 1). Consequently, H should also not be effected by the scaling (since lambda*S should be unchanged).

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Thanks for the answer! I have some things to add I found out myself meanwhile: lambda, kappa, and h values are stored as triplets of numbers in the NCBI smp files, eg. lambda { 267, 10, -3 }. I think that means a value of 267 * 10^-3 = 0.267, being saved in this fashion because the format only supports integers as values. Thus dividing the weights by 100 will result in a loss of accurary, which in my case will not matter much because I use this search only as a pre-filtering step.

ADD REPLY • link 13.7 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Yes, no surprise that dividing the weights by 100 will result in a loss of accuracy. As far as I understand, the whole purpose of initially scaling them up by a factor of 100 was to increase accuracy. Great that you figured out what the triplets mean - it was not clear to me at all :-)

ADD REPLY • link 13.7 years ago by Lars Juhl Jensen 11k

Ram · Answer 2 · 2010-07-25

Here is a little Python script for scaling a matrix with factor 100 down to 1. It does, however, only perform string operations and thus not round values. Lambda is also kept as is.

The Results are reasonable so far, the effect of the rounding error seems to be a minor one (with Lambda*100 results are weird, so I guess this is the right approach).

Usage: script infile outfile

#!/usr/bin/env python

import sys, re
f = open(sys.argv[1], 'r')
content = f.read()
f.close()

start = content.find("scores {")
end = content.find("}", start)
scores = re.sub("([0-9]{2})(?=[\r\n\,])", "", content[start:end])
scores = re.sub("\ (\-)?,", " 0,", scores)

f = open(sys.argv[2], 'w')
f.write(content[0:start] + scores + 
    content[end:].replace("scalingFactor 100", "scalingFactor 1"))
f.close()