Question: how to get p value for a set of fdr value
1
Prakash2.0k wrote:

Hi all,

I am comparing two bed files to find correlation between them.now to examine that whether the correlation is occurring by chance or it is meaningful correlation, in order that i have randomized bed file to 1KB upstream and downstream against which i am finding correlation. now I want to find find P value for the same.

A and B are two bed file. A corr B is 0.9 and A corr C  is 0.2 . Here how P value will be calculated.

C - randomized B to 1kb

I would really appreciate any help .

open • 2.2k views
modified 5.2 years ago by Alternative240 • written 5.2 years ago by Prakash2.0k
1

The topic is misleading, a FDR (false discovery rate) is calculated for a particular p-Value, e.g., using permutation test. Probably, you meant something else?

How do you calculated the correlation between the two BED files? What are you actually comparing, i.e., what kind of entities are in the BED file? Exons? Genes? SNPs?

How exactly did you do the randomization? For each entry in the BED files you randomly selected a value from the interval (-1000; 1000) and added that to the start and end?

Actually I am comparing two CHIPseq peak file. So here I am just comparing coordinates by checking overlap. Suppose if all coordinates of A overlaps with B that we can say correlation is 1. So like that comparison has been done. In order to assess that correlation is significant or not I have randomized coordinates by shifting it to for e.g 1kb upstream or downstream (C). now here my question was how I can say that correlation between A and B  is significant compared to A and C.  and for that I think P value is needed. I hope I made it more clear.

I'm a bit confused... Do u want to get a pvalue assessing the significance of a correlation ?

3
Carlo Yague5.2k wrote:

If I understand well, u want to get a pvalue assessing the significance of a correlation.

Using random permutation is a good idea. However you should do the randomization many times in order to have an empirical distribution of the (A,C) correlation. From that distribution, you can then get the significance of your (A,B) correlation.

Good luck !

Edit : The statistical test to use will depend on your distribution. If it is normal (you can use a normality test to make sure of that), then you could compute the mean, standard deviation and pvalue like this (with R) :

```#rand_corr = vector of random correlations
#here for testing, 10000 random number with 0.2 mean and 0.2 sd
rand_corr=(rnorm(10000,0.2,0.2))

# pval calculation :

mean <- mean(rand_corr) #mean of random (A,C) correlations
sd <- sd(rand_corr) #sd of random correlation
x <- 0.9 #true (A,B) correlation value
z <- (x-mean)/(sd) #center normalize
2*pnorm(-abs(z))  # return pval

# visual representation :

hist(rand_corr, breaks=30)
abline(v=x, col="red")```

However if the normality is not respected (it might not be since you have correlation values that cannot go below 0 or higher than 1), you might need to ressort to other tests.

Hi Carlo

Yes this is what exactly i meant. I want p value for assessing significance of correlation.

As you said, I have randomized the bed files several times. But my question is how p value(significance) will be calculated from that empirical distribution. I am not that good at statistics. so I would be very happy if you can elaborate it.

Just count how often you received a correlation coefficient larger than the one you want to test (0.9) and divide by the number of permutations (but better do something like 1000 or 10000 permutations).

I edited my answer to elaborate as you suggested :)

@Manuel Landesfeind I see why you would do that but I don't think you could call that a "pvalue".

1

Hmm... the fraction `#(corr >= 0.9)/ #permutations` should converge (with a sufficient number of permutations) toward the probability for observing a correlation of 0.9 or higher from the given sample values just by chance... how does this differentiate from a p-Value?

Given an infinite number of permutations and given that the correlation coefficient truly follow a Gaussian distribution, we should get the the same resulting value, I guess. [EDIT] Probably not exactly, because from your R-Code "2*pnorm(...)" I think you get a p-Value for observing a correlation more extreme (i.e.,  x <= -0.9 or x >= 0.9 ), right? [/EDIT]

In fact, people use statistical distributions to circumvent a computationally expensive permutation test. For example, this allows a direct calculation of a p-Value from the correlation coefficient (see http://vassarstats.net/rsig.html). But, if you already calculated correlation coefficients (or any other value) for a sufficient number of permutations, you can directly get your p-Value from the calculated values. [EDIT2] However, I think that permutation tests are far more robust than p-Values estimated from a distribution. [/EDIT2]

PS: I really like statistics but I would not call myself an expert. Probably, somebody with more expertise can comment.

Oh, nice point !

Thanks Carl and Manuel for discussion !!

I am going to try the method which Carl has mentioned. as i already have correlation coefficient value and several numbers of permutations because 1000 or 10000 permutations would be really computationally expensive.

I have taken 6 permutations. that would be sufficient or i shall increase it little more.

I would appreciate any further suggestion

1

While I agree with Manuel's method with high number of permutation, I don't think you should use it with only 6 permutations. Imagine that your 6 permutations give correlations below 0.9, then you would have a pvalue of 0. 0, a (relatively) close approximation of your true pvalue, but you obviously can't report it like this.

A similar method that account for the uncertainty in the pvalue approximation is the Wilcoxon-Mann-Whithney test. This test doesn't assume normality.

EDIT : code in R, with 6 permutation around 0.2.

``> wilcox.test(c(0.9),c(0.22,0.3,0.15,0.21,0.4,0.2),alternative="greater")``

`    Wilcoxon rank sum test`

```data:  c(0.9) and c(0.22, 0.3, 0.15, 0.21, 0.4, 0.2) W = 6, p-value = 0.1429 alternative hypothesis: true location shift is greater than 0```

EDIT2 : If you increase the number of permutations, the pvalue will decrease as your confidence increases.

From my point of view, six permutations are far to low for a decent estimation of a p-Value! Did you see that Carlo used 10.000 permutations in his example?

Probably, you should better use bedtools as suggested by Pierre (see below) or check papers in your research area to get a feeling on how they do it. To be honest, I do not known how good your method for creating the permutations is... but I am also not into CHIPseq analyses...

1
Alternative240 wrote:

Bedtools can do such statistics using fisher tests. Check http://bedtools.readthedocs.org/en/latest/content/tools/fisher.html