Javascript: Searching A Pwm-Motif In A Dna Sequence
1
1
Entering edit mode
10.3 years ago

I'm trying to implement a javascript-based search of PWM in a DNA sequence. I've been inspired by the following post: http://davetang.org/muse/2013/10/01/position-weight-matrix/ and by the R/Biostrings-code https://github.com/genome-vendor/r-bioc-biostrings/blob/master/src/match_PWM.c. See my code below.

I'm starting from a PFM, and I convert it to PWM using the algorithm described in http://davetang.org/muse/2013/10/01/position-weight-matrix/. I get the same results:

PWM input

A 0 4 4 0 3 7 4 3 5 4 2 0 0 4
C 3 0 4 8 0 0 0 3 0 0 0 0 2 4
G 2 3 0 0 0 0 0 0 1 0 6 8 5 0
T 3 1 0 0 5 1 4 2 2 4 0 0 1 0

PWM output

A -1.936751795439824 0.7980887856393778 0.7980887856393778 -1.936751795439824 0.4535418763586175 1.5094375840527072 0.7980887856393778 0.4535418763586175 1.0760077609593484 0.7980887856393778 0 -1.936751795439824 -1.936751795439824 0.7980887856393778
C 0.4535418763586175 -1.936751795439824 0.7980887856393778 1.6854416207624505 -1.936751795439824 -1.936751795439824 -1.936751795439824 0.4535418763586175 -1.936751795439824 -1.936751795439824 -1.936751795439824 -1.936751795439824 0 0.7980887856393778
G 0 0.4535418763586175 -1.936751795439824 -1.936751795439824 -1.936751795439824 -1.936751795439824 -1.936751795439824 -1.936751795439824 -0.665198492276212 -1.936751795439824 1.308938775371164 1.6854416207624505 1.0760077609593484 -1.936751795439824
T 0.4535418763586175 -0.665198492276212 -1.936751795439824 -1.936751795439824 1.0760077609593484 -0.665198492276212 0.7980887856393778 0 0 0.7980887856393778 -1.936751795439824 -1.936751795439824 -0.665198492276212 -1.936751795439824

Searching the motif in the DNA sequence : I'm not confident about my code.

Score("GAAATTACAAGGG"): null
Score("ttacataagtagtc"): null
Score("ttacataACCagtc"): null
Score("CAACTAAAAAGGGC"): 2.692960768092946

My main problem is about the min_score https://github.com/genome-vendor/r-bioc-biostrings/blob/master/src/match_PWM.c. As far as I understand this score should is a fraction 0<=min_score<1 . But (am I wrong ?), there is no normalization here on the X-axis. So is my code correct and if no, how should I fix it?

Thank you,
Pierre

motif • 3.9k views
ADD COMMENT
0
Entering edit mode

The script that is mentioned appears to be available here

ADD REPLY
4
Entering edit mode
10.3 years ago

The code looks very fine.

Since PWM values are log-odds, a final score greater than 0 means that a query motif is more similar to the sequence motifs that were used to build the matrix, rather than to any random (unrelated) motif. In contrast, negative scores indicate that query sequence is more similar to the background model than to the PWM profile. That's why, in general, you can just set the min_score parameter as 0 since you are looking for similar sequences.

However, score by itself doesn't tell much. You might also want to provide a direct estimate (e.g. p-value, e-value) of the significance of any score calculated from your PWM. This can be done very simply by randomization-based test, for example:

  1. Generate 1000 random DNA motifs respecting real frequencies of different nucleotides.
  2. Calculate scores of these motifs without setting min_score parameter (save negative scores too).
  3. Sort the distribution of scores.
  4. If you assume 0.05 as a confidence level, then last 50 score values from your sorted list are treated as statistically significant.
  5. Then, min_score might be set to the value of 950 element from the list.
ADD COMMENT

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6