I have a set of sequences in which I am identifying over-represented transcription factor binding sites. I've used Patser to identify all possible matrices using Transfac and Jaspar libraries. I'm trying to calculate z-scores for my matrices. Are there any functions to do that in Perl or R?
# PFM from JASPAR = input file
A 16 352 3 354 268 360
C 46 0 10 0 0 3
G 18 2 2 5 0 20
T 309 35 374 30 121 6
# INPUT kmers
TTGGGG
TATATA
TATAAA
TAAATA
# To convert PFM to PWM
w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )
where
w - is a weight for the current nucleotide we are calculating
f - is a number of occurences of the current nucleotide in the current column (e.g., "61" for A in column 1, "46" for C etc)
N - total number of observations, the sum of all nucleotides occurences in a column (61+46+18+31=156 in this example)
p - [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)
# PWM we get:
A -0.43 1.11 -0.27 1.10 1.46 1.09
C -0.83 -0.21 -0.36 -0.21 -0.21 -0.23
G -0.42 -0.22 -0.26 -0.25 -0.21 -0.35
T 1.54 -0.44 1.09 -0.41 -1.53 -0.25
# To calculate z-score
z = (x - mean)/sd
The variables in the z-score formula are:
z = z-score
x = raw score or observation to be standardized
mean = mean of the population
sd = standard deviation of the population
Also I want to add that I have 2 output files:
1st file contains the matches for my sequences
2nd file contains matches for shuffled sequences(1000 times shuffled)
The z-score that I have to calculate should be
z-score = (no. of matches in unshuffled sequences - Avg. no of matches in shuffled sequences) /standard deviation in shuffled
I dont understand how to calculate standard deviation for my shuffled sequences. Can you help me?
Thanks
Hi Diana,
You can get the z-scores for all the 18 sequences by the method mentioned here: http://wise.cgu.edu/sdtmod/measures2.asp . Please make sure you use the p-values and not the ln(pvalues). I hope this is what you are looking for. You can repeat the same analysis for all the randomization.
seq1 10002 000002 8.26 10.25
seq1 10002 000013 9.38 11.65
seq1 10003 000594 8.06 10.17
seq1 10003 000082 7.13 7.92
and I have about 18 sequences. This output is received after 1000 shuffles. How can I calculate the mean and sd for this because I dont have individual values that make up the raw score. I'm sort of lost.
You asked for a function, but perhaps it is also useful to have a reference. We use the paper MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data by Doniger, Conklin, et al (2003) Genome Biology 2003, 4:R7.
z-score implies normal distribution. Check out "deseq" for this task instead.