Question: What is happening in the Nussinov Algorithm?
0
M.O.L.S10 wrote:

I am trying to understand the Nussinov Algorithm, and I have a few questions: maybe more.

What are i and j. Are they just called this because there are paired with each other?

What is K and K + 1 in the bifurcation?

What is happening during initialisation?

From what I can see, one just puts 0s in a matrix , but why is this done? And why is half of the diagonal matrix not completed at all ?

Also, what is happening during recursion?

Please do not explain in pseudocode =).

nussinov algorithm • 339 views
modified 9 months ago • written 10 months ago by M.O.L.S10
4
cschu1812.5k wrote:

What are i and j. Are they just called this because there are paired with each other?

i and j are two bases. They might be called a and b, b_i and b_j or bugs and bunny. i and j are only assumed to be paired with each in case 1 of the scoring function.

What is K and K + 1 in the bifurcation?

The bifurcation is needed to deal to model the case when bases i and j belong to two distinct structure motifs (loops), e.g. two hairpins. More concrete: motif 1 = S[i..k] and motif 2 = S[k'..j] with k' = k+1..j-1. This will then give the maximum number of basepairs as the sum of basepairs in motif 1 and motif 2.

What is happening during initialisation? From what I can see, one just puts 0s in a matrix , but why is this done? And why is half of the diagonal matrix not completed at all ?

Given an RNA sequence S, the cell N[i, j] in the Nussinov matrix gives you the maximum number of possible base pairs in the subsequence S[i..j].

Think about what this means for the matrix:

1. Each pair of bases (i, j) represents a sequence of length j - i + 1. The main diagonal is for sequences of length 1 (S[i..i]) and the one below it for sequences of length 0. Since sequences cannot have negative lengths, the bottom part of the matrix is left unfilled.
2. There is no base pair in sequences of length 1 and 0, hence N[i, i] = 0 (length 1) and N[i, i-1] = 0, for i in {1..length(S)}. These are the starting cases (hence "Initialization")

Also, what is happening during recursion?

During recursion you go through each subsequence S[i..j] and calculate the maximum number of base pairs possible for that sequence. This is done based on maximal numbers of base pairs that you have calculated (or initialized to 0) in the preceeding steps and the initialization phase.

0
M.O.L.S10 wrote:

This is a reply, not an answer... I understand the use of j-i +1. Without this the initialisation doesn't make sense. I appreciate your help, Thank you, although it has brought about a few more questions...

So I understand that i am comparing 1 RNA sequence with itself. If I was doing an alignment I would be inserting gaps and such, so I'm not doing that in this case. The RNA sequence can also be written on the top and left hand side of the matrix, along with the axes numbers and the coordinates for each cell in the matrix. N[i1, j1] excetera.

Can I assume that a sequence of length 0 consists of no ribonucleotides and a sequence of length 1 consists of a single ribonucleotide? Then how do i know what a sequence of length 3 or 4 looks like? and how do I know if the sequence of length 3 or 4 is base paired or not and where the base pairs are?

After initialisation, is it okay to start with sequences of length 2, and see if they base pair or not? Is this the minimum length I can start with, or should i make sure that it is of length 3, so that there is a guaranted base pair somewhere?

Is the output of this recursion the structure of the base paired RNA in dot bracket form or is it the structure of the base paired RNA in ribonucleotide letters?

During recursion, why should I traverse through and fill up the matrix in rows of negative diagonals? Is there a special reason for this?

What is case one of the scoring function?

Also, what am I doing with the Nussinov Algorithm? Am I looking for the structure that has the most base pairs? Am I looking for the structure with the most-likely number of base pairs(not the most, not the least)?

If the scoring functions are:

score i, j-1

score( i+1, j-1) + different_score (i, j)

score (i + 1, j)

Is there a particular one that should be used first, second and/or third, or does it not matter?

Maybe these aren't scoring functions. If not, then the scoring function is put a 1 if there is a base pair that can occur and a 0 if a base pair cannot occur.

Can I assume that a sequence of length 0 consists of no ribonucleotides and a sequence of length 1 consists of a single ribonucleotide?

Yes.

Then how do i know what a sequence of length 3 or 4 looks like? and how do I know if the sequence of length 3 or 4 is base paired or not and where the base pairs are?

In cell (i,j) you’re looking at the maximum number of basepairs that can occur in subsequence S[i..j]. At that point you’re determining if bases i and j forming a pair is better than any of the other possibilities (i unpaired, j unpaired, bifurcation). For a sequence of length N, you’re looking at a potential basepair between bases 1 and N, but all you know at that point (during filling) is how many other basepairs are in the structure, not where they are. Those details don’t matter to the algorithm at that point.

After initialisation, is it okay to start with sequences of length 2, and see if they base pair or not? Is this the minimum length I can start with, or should i make sure that it is of length 3, so that there is a guaranted base pair somewhere?

During recursion, why should I traverse through and fill up the matrix in rows of negative diagonals? Is there a special reason for this?

You’re doing sequences of length 2, 3, 4, …, N. That is why you fill the diagonals in that particular manner, i.e. each diagonal represents sequences of a certain length. (Remember, main diagonal: length 1). Concerning “guaranteed” basepairs: by default, the Nussinov algorithm will model basepairs even between adjacent bases, i.e. in sequences of length 2. A common homework assignment is to modify the algorithm so that it does obey a minimum hairpin loop size of 3 bases.

Is the output of this recursion the structure of the base paired RNA in dot bracket form or is it the structure of the base paired RNA in ribonucleotide letters?

The output of the “fill”-recursion is the filled matrix containing the maximum numbers of base pairs. After traceback recursion, you will have a list of pairs (i, j), which you then can use to visualise the structure however you like. What is case one of the scoring function?

In my book that is the (i + 1, j), which would be: what score does an unpaired base i contribute to the structure.

Also, what am I doing with the Nussinov Algorithm? Am I looking for the structure that has the most base pairs? Am I looking for the structure with the most-likely number of base pairs(not the most, not the least)?

You’re looking for the maximum number of basepairs that a certain sequence can -- theoretically -- contain according to a simple scoring model.

If the scoring functions are: score i, j-1 score( i+1, j-1) + different_score (i, j) score (i + 1, j) Is there a particular one that should be used first, second and/or third, or does it not matter?

During fill, you’re taking the max anyway. During traceback, it’s your choice.

Maybe these aren't scoring functions. If not, then the scoring function is put a 1 if there is a base pair that can occur and a 0 if a base pair cannot occur.

It is “kind” of a scoring function, just a very simple one.