Calculating Z-Scores
2
4
Entering edit mode
12.4 years ago
Diana ▴ 910

Hello everyone,

I have a set of sequences in which I am identifying over-represented transcription factor binding sites. I've used Patser to identify all possible matrices using Transfac and Jaspar libraries. I'm trying to calculate z-scores for my matrices. Are there any functions to do that in Perl or R?

Thanks!!

transcription binding enrichment • 6.9k views
ADD COMMENT
0
Entering edit mode

z-score implies normal distribution. Check out "deseq" for this task instead.

ADD REPLY
7
Entering edit mode
12.4 years ago
Gjain 5.8k

Hi Diana,

you can follow this procedure:

# PFM from JASPAR = input file
A   16  352 3   354 268 360
C   46  0   10  0   0   3
G   18  2   2   5   0   20
T   309 35  374 30  121 6

# INPUT kmers
TTGGGG
TATATA
TATAAA
TAAATA

# To convert PFM to PWM
w = log2 ( ( f + sqrt(N) * p ) / ( N + sqrt(N) ) / p )
where
    w - is a weight for the current nucleotide we are calculating
    f - is a number of occurences of the current nucleotide in the current column (e.g., "61" for A in column 1, "46" for C etc)
    N - total number of observations, the sum of all nucleotides occurences in a column (61+46+18+31=156 in this example)
    p - [prior] [background] frequency of the current nucleotide; this one usually defaults to 0.25 (i.e. one nucleotide out of four)

# PWM we get:
A   -0.43   1.11    -0.27   1.10    1.46    1.09    
C   -0.83   -0.21   -0.36   -0.21   -0.21   -0.23   
G   -0.42   -0.22   -0.26   -0.25   -0.21   -0.35   
T   1.54    -0.44   1.09    -0.41   -1.53   -0.25

# To calculate z-score
z = (x - mean)/sd
    The variables in the z-score formula are:
    z = z-score
    x = raw score or observation to be standardized
    mean = mean of the population
    sd = standard deviation of the population

For example:

kmer:TATAAA
raw score: 1.54+1.11+1.09+1.1+1.46+1.09 = 7.39
$zscore = ($raw_score - $mean)/$std_dev;

you can calculate $mean and $std_dev and get the zscores.

hope this helps.

ADD COMMENT
1
Entering edit mode

Also I want to add that I have 2 output files: 1st file contains the matches for my sequences 2nd file contains matches for shuffled sequences(1000 times shuffled) The z-score that I have to calculate should be z-score = (no. of matches in unshuffled sequences - Avg. no of matches in shuffled sequences) /standard deviation in shuffled I dont understand how to calculate standard deviation for my shuffled sequences. Can you help me? Thanks

ADD REPLY
1
Entering edit mode

Hi Diana, You can get the z-scores for all the 18 sequences by the method mentioned here: http://wise.cgu.edu/sdtmod/measures2.asp . Please make sure you use the p-values and not the ln(pvalues). I hope this is what you are looking for. You can repeat the same analysis for all the randomization.

ADD REPLY
0
Entering edit mode

Thank you Gjain for your answer. My output file looks like this:

Sequence name| Motif ID | Seq co-ord | Motif Score | -ln(p-value)

seq1 10002 000002 8.26 10.25 seq1 10002 000013 9.38 11.65 seq1 10003 000594 8.06 10.17 seq1 10003 000082 7.13 7.92 and I have about 18 sequences. This output is received after 1000 shuffles. How can I calculate the mean and sd for this because I dont have individual values that make up the raw score. I'm sort of lost.

ADD REPLY
1
Entering edit mode
12.1 years ago

You asked for a function, but perhaps it is also useful to have a reference. We use the paper MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data by Doniger, Conklin, et al (2003) Genome Biology 2003, 4:R7.

ADD COMMENT

Login before adding your answer.

Traffic: 1814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6