Calculating Z-Scores
11.5 years ago
Diana ▴ 900

Hello everyone,

I have a set of sequences in which I am identifying over-represented transcription factor binding sites. I've used Patser to identify all possible matrices using Transfac and Jaspar libraries. I'm trying to calculate z-scores for my matrices. Are there any functions to do that in Perl or R?

Thanks!!

transcription binding enrichment
11.5 years ago
Gjain 5.7k

Hi Diana,

# PFM from JASPAR = input file
A   16  352 3   354 268 360
C   46  0   10  0   0   3
G   18  2   2   5   0   20
T   309 35  374 30  121 6

# INPUT kmers
TTGGGG
TATATA
TATAAA
TAAATA

# To convert PFM to PWM
w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )
where
w - is a weight for the current nucleotide we are calculating
f - is a number of occurences of the current nucleotide in the current column (e.g., "61" for A in column 1, "46" for C etc)
N - total number of observations, the sum of all nucleotides occurences in a column (61+46+18+31=156 in this example)
p - [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)

# PWM we get:
A   -0.43   1.11    -0.27   1.10    1.46    1.09
C   -0.83   -0.21   -0.36   -0.21   -0.21   -0.23
G   -0.42   -0.22   -0.26   -0.25   -0.21   -0.35
T   1.54    -0.44   1.09    -0.41   -1.53   -0.25

# To calculate z-score
z = (x - mean)/sd
The variables in the z-score formula are:
z = z-score
x = raw score or observation to be standardized
mean = mean of the population
sd = standard deviation of the population


For example:

kmer:TATAAA
raw score: 1.54+1.11+1.09+1.1+1.46+1.09 = 7.39
$zscore = ($raw_score - $mean)/$std_dev;


you can calculate $mean and$std_dev and get the zscores.

hope this helps.

Also I want to add that I have 2 output files: 1st file contains the matches for my sequences 2nd file contains matches for shuffled sequences(1000 times shuffled) The z-score that I have to calculate should be z-score = (no. of matches in unshuffled sequences - Avg. no of matches in shuffled sequences) /standard deviation in shuffled I dont understand how to calculate standard deviation for my shuffled sequences. Can you help me? Thanks

Hi Diana, You can get the z-scores for all the 18 sequences by the method mentioned here: http://wise.cgu.edu/sdtmod/measures2.asp . Please make sure you use the p-values and not the ln(pvalues). I hope this is what you are looking for. You can repeat the same analysis for all the randomization.

Thank you Gjain for your answer. My output file looks like this:

# Sequence name| Motif ID | Seq co-ord | Motif Score | -ln(p-value)

seq1 10002 000002 8.26 10.25 seq1 10002 000013 9.38 11.65 seq1 10003 000594 8.06 10.17 seq1 10003 000082 7.13 7.92 and I have about 18 sequences. This output is received after 1000 shuffles. How can I calculate the mean and sd for this because I dont have individual values that make up the raw score. I'm sort of lost.

11.2 years ago

You asked for a function, but perhaps it is also useful to have a reference. We use the paper MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data by Doniger, Conklin, et al (2003) Genome Biology 2003, 4:R7.