Going For The Right P-Value With Limma
5
6
Entering edit mode
9.9 years ago
Tomas ▴ 30

Hi,

I was wondering if there is a way to decide what kind of p-value will be the right one to identify differentially expressed genes.

I am running a microarray analysis and using the LIMMA package for the identification of those DEG. I am getting a table with the toptable command. I have a table of almost 10000 genes.

How do I know where to set the parameter for a p-valueas a threshold?

I read quite a few papers about it and there is basically no consensus. Most of the people set p<0.01 as one, but by doing so I don't get much of data. IS there a way of analyzing/plotting the p-values to get a better overview of the "right" p-value?

limma p-value differential r microarray • 17k views
5
Entering edit mode
9.9 years ago

Assuming that you are using the defaults and that you are talking about using the adj.p.val, there is a natural and intuitive interpretation. The adjusted p-value is actually a false discovery rate. The natural interpretation is that at 0.01, one out of 100 genes in the list are likely to be false discoveries. At 0.05, five out of 100 in the list are likely false discoveries. What is the best cutoff? It depends. But using the concept of the false discovery rate, you can see that a given cutoff leads to a natural interpretation of the results.

As for your particular experiment, 10000 genes that show differential expression at adj.p.value of 0.01 seems pretty high, but it might make sense for your experiment. Have you looked at a heatmap of your results? Do they make sense?

2
Entering edit mode

You cannot use the raw p-value, unfortunately, as it is not corrected for multiple testing. Besides the comment from neilfws, you might also consider using geneset-based testing like the limma romer and roast functions or a package like globaltest. The idea is to capture biological signal from multiple genes simultaneously where each gene, taken individually, is uninformative. This is a related concept to that in @Davy's answer.

1
Entering edit mode

Like Sean says, the adjusted p-value is the one to use, because you have many more variables (10000) than samples (3 x ? arrays). It's not uncommon to get no significant DE using limma, particularly with a small number of replicates or a large number of probes. I'd suggest giving siggenes a try too - http://bioconductor.org/packages/release/bioc/html/siggenes.html - if that gives no DE except at high FDR values, you'd have to conclude that the arrays are uninformative.

0
Entering edit mode

by 10K I meant the complete table from LIMMA, not DEG. That's exactly the problem though. In the experiment I have three replicates for each array. After running LIMMA I have one adj.p.val of 0.717 and the rest =1. My p-values are also quite high - all in all I have 34 genes with a p-value<0.01. This is why I ask if there is any point in going higher with the p-value like p-val=0.1 (488 genes). Tomas

5
Entering edit mode
9.9 years ago
seidel 9.0k

There is no consensus because there is no "right" p-value. Remember that the p-values you are calculating are based on assumptions. Assumptions that the distribution of the data has a certain shape, that the variability of your replicates has a certain distribution. The p-value represents a fractional area from the shape of a distribution - which may or may not fit your data very well. Experiments of this kind have many variables (that you may be unaware of) such that microarrays can give data distributions that are significantly different from what one might expect. I've seen experiments that return no "good" p-values, that a statistician declares a failed experiment, yet as a biologist I can see a gold mine of information there. In these cases, realize that your treatment is based on assumptions (that you probably haven't or don't know how to test), and simply use the p-values as a ranking mechanism - and forget about absolute magnitude. Be a good experimentalist and look for any signal in the data which makes biological sense (as other have suggested, use gene set analysis, or find any predictable marker gene), so you can decide if there is anything there, and perhaps how you can do a better experiment next time. On the other hand, if there is nothing there, it may be a truly worthless experiment and you'll waste time trying to mine bad data, chasing false leads. P-values are not magic, they are areas under a curve that may or may not be relevant to your data. Follow Sean's and Davy's and Neal's suggestions, explore the shapes, be a good experimentalist, be a skeptical biologist.

0
Entering edit mode

Where is the up-vote one million button? Excellent post. I would add, that if Tomas can provide more details about his experiment we might be better able to advise on what to do next. What it a simple treatment versus control. Does one group have a particular trait you are interested in looking at? How many arrays/samples did you use. P-values and the analyses mentioned in the blog post above are not the be all and end all of microarray data analysis. A bit more context would be help with helping!

1
Entering edit mode
9.9 years ago
Davy ▴ 210

To get an idea of the distribution of your p-values you could use:

tT <- topTable(fit2, adjust.method="fdr", number=nfow(fit2))


Then plot

x <- -log10(tT\$adj.p.val)
plot(x, type="l")
sigline <- c(.05, .01, .005, .001,.0005, .0001)
sigline <- -log10(sigline)
sigcolors <- c("red", "blue", "green", "yellow","pink","purple")
sapply(1:length(sigline), function(x){abline(h=sigline[x], col=sigcolors[x])})


Not getting a large list of genes that are differentially expressed might actually be a good thing depending on what genes they are and the conditions you are comparing. See the following blog for an idea about what you can do with your data afterwards:

http://gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html

0
Entering edit mode

Why do you take the log10 while the complete analysis is usually on a log2 scale? A second question is - what does it tells me? I ran it and saw the colored lines, but which one do I need to pick? BTW, thanks for the link - it is very good info of great value for me. Tomas

0
Entering edit mode

The log10 above is only for making the plot in log scale and does not affect the analysis of differential expression.

0
Entering edit mode

@Davy: The log10 above is only for making the plot in log scale and does not affect the analysis of differential expression.

0
Entering edit mode

Sorry, should have explained better. The only reason I used the -log10 of the p-values is to plot them with smaller p-values appearing as larger numbers on the plot. It has no bearing on the analysis, either up or downstream of simply being able to see all your p-values in plot. The coloured lines represent the various significance thresholds that I defined in the vector sigline. They give you a rough idea of how many probes are passing each threshold.

0
Entering edit mode

sorry for the late response. I can see the line for the different p-values. But which one is the best p-value fr my analysis? According to what parameter do you choose the right threshold?

0
Entering edit mode

You are not going to get an answer from this group on the question of what the right number is because there is no one-size-fits-all number to use. If you are unclear about how to interpret your results, I suggest you find a local collaborator who can work with you on your data.

0
Entering edit mode
9.8 years ago
tscheitel • 0

we're comparing mouse microarrays of knock-out mice vs. wild-type. I have three replicates for each treatment. (I know this is not much, but this is what we've got. we are interested in prolonged life expansion due to several KO-mutants.

Some of the genes we are interested in shows a "good" behavior. With that I mean for example, that they show differential expression in the way we expected them to have. The problem is, that the significance depends on the threshold I am going to choose. The genes we are looking for have partially a adj.P.Val higher than the threshold of 0.05, but lower than 0.1. So If I choose the 0.05 the genes will be unimportant, but with 0.1 they will be o.k.

The problem is not what the right one, as I already understood there is no right one, but more important is how to explain the decision to take the 0.1 threshold, besides just saying it fits better to the experiment.

Is there a statistical/methodical way of explaining the threshold chosen?

Thanks

PS I uploaded the image of the example Sean provided. cut-off graph

0
Entering edit mode

Assuming that you are using a false discovery rate adjusted p-value, 0.1 threshold means that 10% of the genes in your list are likely false positives. For many folks, that is an acceptable result, but you will need to decide for yourself. Note that choosing a threshold is not arbitrary (since the false discovery rate has a pretty clear interpretation for gene lists), but it is in the context of a given experiment.

0
Entering edit mode

it looks to me that these are not the adj. p-values, but the p-values. At least this is what the y-axis tells me. If I understand what you are looking for, Sean is right, there is no straight answer to that. But you can take as a approximation the line where the curve begins to run parallel to the x-axis. I am not sure why this is to be honest, but this is how I learned it.

0
Entering edit mode

I agree with frymor that the plot looks to be of p-values. Multiple testing corrections such as the FDR must be applied to gene expression data for them to be meaningful. Please try to ignore the p-values and use the adj.p.value only.

0
Entering edit mode
9.7 years ago
alex-bain • 0

Hi Sean, Frymor and Davy,

I am a newbie to microarray data analysis. I've got a data set of microarrays to work on. I was able to run the rma normalization and LIMMA analysis.

Now I have a big table of over 25K genes. If I understood it correctly most of them are not differentially regulated. The question is of course, how to identify the genes which are significant?

I was using the example offered here to plot my data and got the same curve and lines of p-values. Am I wrong to understand that this plot is the same as the histogram of the p-values distribution which is easily done by various examples and shows a something like this: histogram - image was taken from http://www.tcrt.org But, now that I have it, what do I do with it?

I know that the plot shows me the distribution of the p-values over my experiments. The person I am working with told me I need to look for the point, where the curve goes flat-lined, but this is as arbitrary as just picking a p-value by chance. His reason for that choice was - everybody is doing so. This is what I understood after reading some papers -

1. In a normal experiment most of the genes won't be differentially regulated.
2. the adj. p-value is my multiple testing hypothesis correction value.
3. the higher this value is, the more false positive I get in my list of DE genes.

But all these doesn't help me to find the right threshold. I read this paper: Estimating p-values in small microarray experiments. Here they try to explain why permutations is a good idea with small data sets (which I also have - four replicates for three different conditions each). But still not a clue about the 'right' value

> You are not going to get an answer from this group on the question of what the right number is because there is no one-size-fits-all number to use. If you are unclear about how to interpret your results, I suggest you find a local collaborator who can work with you on your data.


I can understand Sean's saying it is difficult to get an answer to such a question, as there is no straight or direct answer for that. Each experiment is a single, unique data set. But I am sure there is some kind of method to define the right p-value by looking at the distribution of the data. AN explanation might be that I am to choose the point where the curve is getting flatter for the reason that at this point I will have the highest number of significantly differentially regulated genes with the smaller number of false positive in this data.

Is this an explanation which can stand?

Thanks for the help?

Alex

0
Entering edit mode

There is no "right answer" here. It has been suggested that you need to use the false-discovery-rate adj.p.value. Doing so has a very natural consequence--namely, the list of genes has a quantifiable false-discovery-rate. If you are looking at an experiment for which you want to be all-inclusive in your list of genes, you might use an adj.p.value of 0.5 (50% of the genes are expected to be false-positives). In another case, you might want to be as strict as possible and use a false-discovery-rate of 0.001 (1 in 1000 genes are false discoveries). Note that there is not a statistical basis for making this decision. The decision is based entirely on the experimental circumstances. If you still have questions, please consult a local bioinformatics expert who can work through your data with you.