Question

Problem With Fdr Test : Q-Values All < 0.05 For A List Of P-Values That Are Not All Significant.

4

Entering edit mode

11.9 years ago

Francois Olivier Hébert ▴ 280

Hi,

I found 2,302 significant SNPs in genomic sequences representing a few hundred genes sequenced for 2 populations (12 samples / population). Based on the allele counts, I calculated the allele frequencies at each locus. Then, I calculated de allelic frequency divergence (AFD), i.e :

AFD = freq[a1,pop1] - freq[a1,pop2] (allele-freq for pop1 - allele-freq for pop2)

With that, I want to discover which SNPs show a significant difference in allele frequency between both populations in order to identify putative divergent genes (I also performed an outlier test with Bayescan and I also have some problems with that, but it's not the topic of this question!).

I used the Python package "fisher-0.1.4" (http://pypi.python.org/pypi/fisher/) to compute the p-values for each AFD (i.e for each SNP) and I counter-verified these values with R, building a 2x2 matrix for each SNP and performing this test:

fisher.test(mat, alternative="less")

I find that R computes the same p-values than Python. Based on these p-values, among the 2,302 SNPs, 56 SNPs show a AFD > 0.5 and a p-value < 0.05. Now, I want to see if those SNPs are also significant when corrected for multiple hypotheses (FDR). I tried to perform a FDR test to compute q-values based on the distribution of p-values (for the 2,302 SNPs). This is where things get complicated...

I tried two R packages: "qvalue" AND "fdrtool". With both of them, I get an error message, which makes me think that there is something wrong with my dataset of p-values. Here is what I did for the package "qvalue" :

> p <- scan("~/Desktop/pvalues.txt")
Read 2302 items
>qobj <- qvalue(p, fdr.level=0.05, pi0.method="bootstrap")
[1] "ERROR: The estimated pi0 <= 0. Check that you have valid p-values or use another lambda method."

And even when I try to change some parameters in the test, I ALWAYS get this error message.

Then with the package "fdrtool", this is what I did:

> fdr <- fdrtool(p, statistic="pvalue")
Step 1... determine cutoff point
Step 2... estimate parameters of null distribution and eta0
Step 3... compute p-values and estimate empirical PDF/CDF
Step 4... compute q-values and local fdr
Step 5... prepare for plotting

Warning message :
Censored sample for null model estimation has only size 2 !

With that second package (fdrtool), I can try to visualize the q-values computed with the command:

fdr$qval

But ALL the q-values are < 0.05, which is quite strange. How can I have p-values > 0.05 that become significant (i.e < 0.05) when I correct for FDR? And how is it possible that ALL the q-values (n=2,302) are significant? I decided to include in my question a graph showing the distribution of my p-values. Maybe this can help...

http://www.freeimagehosting.net/5v76e

UPDATE 02-06-2012 : max p-value = 0.758 and min p-value = 2.73E-10.

If someone could help me, it would be tremendously appreciated!

Cheers ! FO

statistics • 19k views

ADD COMMENT • link updated 11.9 years ago by Michael 54k • written 11.9 years ago by Francois Olivier Hébert ▴ 280

0

Entering edit mode

can you post the min, max of your p-values here? The error seems to indicate that your p-values are < 0.

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

Sure I can do that ! I just updated my question by giving the min and max p-values. I have some p-values that are really small, but they are all above 0.

ADD REPLY • link 11.9 years ago by Francois Olivier Hébert ▴ 280

0

Entering edit mode

Okay. So I got the error part wrong. Would it be possible for you to upload the p-values somewhere and link here? Or if you're willing to try a multiple-correction using R and multtest package, I can provide you the snippet to accomplish this and see how it turns out..

ADD REPLY • link 11.9 years ago by Arun 2.4k

0

Entering edit mode

Problem solved! Thanks for your time. I think my dataset wasn't of super good quality... so I changed a few things and it works now. Since your first answer, I was concerned with the range of my p-values and it turned out that it was the core of the problem. So thanks again!

ADD REPLY • link 11.9 years ago by Francois Olivier Hébert ▴ 280

score 3 · Answer 1 · 2012-06-04

First, I think you are correct with being concerned about the estimated q-values and the presence of the warning. In my understanding, p-values adjusted for multiple testing should always be greater-equal to the unadjusted p-values. The warning also indicates that something in the estimation process went wrong. It might have to do with your p-values not fully covering the whole interval [0,1]. If the method tries to find an estimate for the percentage of true null-hypotheses in the data, due to the lack of p-values close or equal to 1, it might not be able to do so. At least this is my interpretation of the warning message.

I have tried to reproduce the behavior, given your p-value range. It seems to be reproducible with simulated data only.

pvalues <- c(runif(2250, min=2.73E-10, max= 0.758)) # your values
res1 <- fdrtool(pvalues, statistic="pvalue")
Step 1... determine cutoff point
Step 2... estimate parameters of null distribution and eta0
Step 3... compute p-values and estimate empirical PDF/CDF
Step 4... compute q-values and local fdr
Step 5... prepare for plotting

Warning message:
Censored sample for null model estimation has only size 1 ! 
cbind(res1$pval,res1$qval)[res1$pval > res1$qval, ] # have a look at them...
               [,1]         [,2]
[1,] 0.297413740 0.0013290189
[2,] 0.425104731 0.0013472933
[3,] 0.330043362 0.0013349787
[4,] 0.424277501 0.0013472092
[5,] 0.751239836 0.0013910413
....
 sum(res1$pval > res1$qval)
[1] 2249

That yields almost identical significant q-values for all different p-values, ant therefore looks very wrong to me, while:

pvalues <- c(runif(2250, min=0, max=1))
res2 <- fdrtool(pvalues, statistic="pvalue")
sum(res2$pval > res2$qval)
[1] 0

In consequence, the estimation method seem to make assumptions about your data that are maybe not justified. We possibly need to understand the method a bit better, to use other parameters or use a different method. That is all I can say from this applied perspective.

If you are only out for any kind of FDR correction you might as well use p.adjust(p, method="fdr") which will use the method of Benjamini&Hochberg (1995), which has been applied for microarray data for example. I wouldn't comment of the initial test step of your analysis, other than that you could have a look at more complex association tests e.g. in PLINK.