How Can I Apply A Pssm Efficiently?
2
1
Entering edit mode
9.4 years ago
glocke01 ▴ 190

I am fitting for position specific scoring matrices (PSSM aka Position Specific Weight Matrices). The fit I'm using is like simulated annealing, where I the perturb the PSSM, compare the prediction to experiment and accept the change if it improves agreement. This means I apply the PSSM millions of times per fit; performance is critical. In my particular problem, I'm applying a PSWM for an object of length L (~8 bp) at every position of a DNA sequence of length M (~30 bp) (so there are M-L+1 valid positions). (Tags apparently don't permit the '+' symbol, but I'm writing in C++ not C.)

My best idea is to convert the DNA into some kind of a matrix so that applying the PSWM is matrix multiplication. There are efficient linear algebra libraries out there (e.g. BLAS), but I'm not sure how best to turn an M-length DNA sequence into a matrix M x 4 matrix and then apply the PSSM at each position. The solution needs to work for higher order/dinucleotide terms in the PSSM - presumably this means representing the sequence-matrix for mono-nucleotides and separately for dinucleotides.

My current solution iterates over each position m, then over each letter in word from m to m+L-1, adding the corresponding term in the matrix. I'm storing the matrix as a multi-dimensional STL vector, and profiling has revealed that a lot of the computation time is just accessing the elements of the PSSM (with similar performance bottlenecks accessing the DNA sequence).

If someone has an idea besides matrix multiplication, I'm all ears.

c pssm • 3.4k views
1
Entering edit mode
9.3 years ago
glocke01 ▴ 190

This paper discusses the issue of representing DNA as a matrix, and considers PSSM applications. It would seem that representing the DNA in some such format as a matrix should allow one to represent the PSSM in a compatible matrix will make matrix multiplication feasible within BLAS or some such.

Edit 2-22-2016

I believe the following will be the most efficient algorithm to do this with linear algebra assuming we care only about speed of execution, not memory or time to construct the data structures (these assumptions hold for my original application, where the sequences didn't change but the PSSM did). It's so simple I can't imagine why I didn't think of it before now. I don't have an implementation, but it should be straightforward to write up.

Flatten the PSSM into column vector p of length 4*L, with values [A_1, C_1, G_1, T_1, A_2,...]. The DNA sequence at a binding site can be represented as a row-vector s of length 4*L with indicator variables [a_1, c_1, g_1, t_1, a_2,...]. x_i is 1 if the ith base in this site is nucleotide x; e.g. "GAC" would be [0,0,1,0, 1,0,0,0, 0,1,0,0]. So the (inner) product sp is the score of this position.

So, if we represent a large DNA sequence as a matrix S with 4*L columns and M-L+1 rows. Each row of S represents one position, so if your long sequence is "GACTAAC" and L=4, S will look like ["GACT", "ACTA", "CTAA", "TAAC"].

Thus, the product Sp will be a column vector where the i'th element is the score at the i'th position in the sequence.

Now this is a simple matrix*vector calculation that is trivially handled by your preferred linear algebra package.

1
Entering edit mode
9.4 years ago
RossCampbell ▴ 140

You might find the MOODS algorithms helpful. It's written in C++, but it also has Perl and Python interfaces if you are so inclined.

1
Entering edit mode

Thanks, but their algorithm for applying the PSSM is basically the same as mine (though the task to which I'm applying the PSSM's is different from what MOODS is aimed at). The sequence and score matrices are both represented as STL vectors in this code, but this is where I'm starting, what I'm trying to improve on.

0
Entering edit mode

Well, if you're able to find a better algorithm somewhere, can you link it here? I'd be interested in it too.