Sequence name| Motif ID | Seq co-ord | Motif Score | -ln(p-value)

Question

Calculating Z-Scores

4

Entering edit mode

12.9 years ago

Diana ▴ 930

Hello everyone,

I have a set of sequences in which I am identifying over-represented transcription factor binding sites. I've used Patser to identify all possible matrices using Transfac and Jaspar libraries. I'm trying to calculate z-scores for my matrices. Are there any functions to do that in Perl or R?

Thanks!!

transcription binding enrichment • 7.2k views

ADD COMMENT • link updated 12.6 years ago by Larry_Parnell 16k • written 12.9 years ago by Diana ▴ 930

0

Entering edit mode

z-score implies normal distribution. Check out "deseq" for this task instead.

ADD REPLY • link 12.9 years ago by Karl ▴ 350

score 7 · Answer 1 · 2011-12-13

7

Entering edit mode

12.9 years ago

Gjain 5.8k

Hi Diana,

you can follow this procedure:

# PFM from JASPAR = input file
A   16  352 3   354 268 360
C   46  0   10  0   0   3
G   18  2   2   5   0   20
T   309 35  374 30  121 6

# INPUT kmers
TTGGGG
TATATA
TATAAA
TAAATA

# To convert PFM to PWM
w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )
where
    w - is a weight for the current nucleotide we are calculating
    f - is a number of occurences of the current nucleotide in the current column (e.g., "61" for A in column 1, "46" for C etc)
    N - total number of observations, the sum of all nucleotides occurences in a column (61+46+18+31=156 in this example)
    p - [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)

# PWM we get:
A   -0.43   1.11    -0.27   1.10    1.46    1.09    
C   -0.83   -0.21   -0.36   -0.21   -0.21   -0.23   
G   -0.42   -0.22   -0.26   -0.25   -0.21   -0.35   
T   1.54    -0.44   1.09    -0.41   -1.53   -0.25

# To calculate z-score
z = (x - mean)/sd
    The variables in the z-score formula are:
    z = z-score
    x = raw score or observation to be standardized
    mean = mean of the population
    sd = standard deviation of the population

For example:

kmer:TATAAA
raw score: 1.54+1.11+1.09+1.1+1.46+1.09 = 7.39
$zscore = ($raw_score - $mean)/$std_dev;

you can calculate $mean and $std_dev and get the zscores.

hope this helps.

ADD COMMENT • link 12.9 years ago by Gjain 5.8k

1

Entering edit mode

Also I want to add that I have 2 output files: 1st file contains the matches for my sequences 2nd file contains matches for shuffled sequences(1000 times shuffled) The z-score that I have to calculate should be z-score = (no. of matches in unshuffled sequences - Avg. no of matches in shuffled sequences) /standard deviation in shuffled I dont understand how to calculate standard deviation for my shuffled sequences. Can you help me? Thanks

ADD REPLY • link 12.9 years ago by Diana ▴ 30

1

Entering edit mode

Hi Diana, You can get the z-scores for all the 18 sequences by the method mentioned here: http://wise.cgu.edu/sdtmod/measures2.asp . Please make sure you use the p-values and not the ln(pvalues). I hope this is what you are looking for. You can repeat the same analysis for all the randomization.

ADD REPLY • link 12.9 years ago by Gjain 5.8k

0

Entering edit mode

Thank you Gjain for your answer. My output file looks like this:

Sequence name| Motif ID | Seq co-ord | Motif Score | -ln(p-value)

seq1 10002 000002 8.26 10.25 seq1 10002 000013 9.38 11.65 seq1 10003 000594 8.06 10.17 seq1 10003 000082 7.13 7.92 and I have about 18 sequences. This output is received after 1000 shuffles. How can I calculate the mean and sd for this because I dont have individual values that make up the raw score. I'm sort of lost.

ADD REPLY • link 12.9 years ago by Diana ▴ 30

score 1 · Answer 2 · 2012-03-19

1

Entering edit mode

12.6 years ago

Larry_Parnell 16k

You asked for a function, but perhaps it is also useful to have a reference. We use the paper MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data by Doniger, Conklin, et al (2003) Genome Biology 2003, 4:R7.

ADD COMMENT • link 12.6 years ago by Larry_Parnell 16k