I have performed bulk RNAsequencing of tissue (e.g. kidney) from patients with a disease (e.g. diabetes), and I would like to identify genes that correlate with a continuous clinical variable (e.g. Hemoglobin A1c). I can then obtain a Rho (correlation coefficient) and a p value for each gene.
As you can imagine, I am doing this for about 20,000 genes, and will get multiple p values. The knee-jerk reflex is to perform some sort of correction for multiple testing. But I can't help feeling that correlation tests are inherently descriptive, and should be able to stand on their own without requiring multiple testing correction. Maybe not so relevant, but given the fact that we're now dealing with two random variables (A1c and gene expression), each with their own noise, it seems much less likely that a "significant" correlation will appear by chance.
The null hypothesis is also different I think for these two tests - in a test of difference, we're asking whether the observed distributions of data could have arisen by chance given that there is no difference between the two groups. For correlation analysis, we ask whether the strength of the correlation could have arisen by chance alone given that there is no correlation between the two variables. This may appear to be a minor distinction, but I wonder whether it could influence the answer here.
Also, correlation p values for real world biological data involving humans in my experience generally do not sit in the ranges I see for differential gene expression, and performing p value adjustment often leads everything to be "non-significant", even when the Rho's are potentially meaningful (0.5+).
I'd like to do what is methodologically proper, and I'm wondering if anyone can give me a bit of a deeper explanation on whether the maxim "multiple testing requires p value correction" applies for descriptive statistics like P values.
Thanks in advance! Your comments are all very appreciated :)
Thanks rpolicastro for this detailed response. I've considered your answer and thought on the problem some more.
Given the conservative nature of most commonly applied methods of multiple testing correction, could it instead be valid to obtain a p value by permutation testing for such a scenario? I got this idea from the following website (very old thread now), I'm trying to make sure I understood it correctly:
The somewhat conservative nature of p-value correction for false positives is one of the reasons FDR is popular in genomics. FDR doesn't control for the false positive rate (per say) like a more traditional test such as Bonferroni would. Instead it focuses on the false discovery rate, or the reduction of the number of false positives among the positive tests. One of the commenters on that post actually points this out too. It's often a nice balance between being too conservative (and greatly increasing your false negative rate) and less scary/incorrect than applying no correction at all.
As a side note to all of this, it's often good to look at the histogram of uncorrected p-values to get a good idea of the effectiveness of your test, and a first approximation on whether there could plausibly be some effect caused by your alternative hypothesis. See this blog post for a good explanation of this topic.
Thanks again - so where does permutation testing fit in all this, is it an alternative to FDR? Or do you still need to apply FDR testing to p values obtained by permutation?
yes it can be used; its more commonly used in pathway testing than at a gene or individual transcript level.
check out fgsea.
a lot of the points you mention are valid. probably the most important thing is that genes are regulated in a coordinated fashion. So, theres no reason to expect that the test statistics obtained should resemble what would be obtained from 20000 independent tests.
Thanks for making that point, I hadn't considered that!
have you read about WGCNA? Honestly the main deal with whole transcriptome profiling (in my opinion) is the ability to study what happens how in relation to what, and how that changes between states.
to me, that's typically of more biological interest than any single gene test result, which in a vacuum, means very little.
what im trying to say is that the co-expression idea is kind of a big one ...I might start with WGCNA and then trace development of thought forward through time using articles that cite it
Funny you bring it up now because I was reading about it right before I saw this comment! Have you used it or seen it used much before? This technique is now so old (>13 years) and I'm wondering how it has aged.
My general workflow is to look at individual genes that correlate/associate with a phenotype of interest and then perform enrichment analyses, to get a higher-level lens into potential processes at play. I'll explore this alternate process (WGCNA) and see how the results differ. Thanks again.
bro. "so old" but its been cited like 12,000 times or something. cmon now
Hahaha I know, old is not bad - I'm just always hesitant to invest time in learning a technique in the pre-NGS era hahaha. Wow 12,000 citations...
No worries - perhaps here's an idea. if your study has a batch effect, you are likely to see high type I error which might be hard to differentiate from true positives being regulated coordinately.
If you wanted to do something permutation based and make no supposition your data are clean, you could down load many, many result files from the Gene Omnibus website, and look at patterns of covariance and number of significant genes as a function of study size.
you could then permute labels with your own results as well as with the GEO experiments, and re-generate the test statistics (to be clear, I am suggesting running many permutation based tests in parallel across different datasets). if you get vastly different results from what you see in other studies, could be cause for alarm ... If done rigorously, this would be acceptable to almost everyone I imagine, except maybe reviewers who aren't comfortable with stats (depending on target journal).
Regarding batch effect, would it be fair to say that because I'm performing a gene-phenotype correlation analysis using a clinical parameter that is a random continuous variable, and because the batches are random sets (i.e. not grouped according to important phenotype variables), batch effect should not increase Type I error rate?
Not sure if I understand your suggestion correctly - are you suggesting I perform a similar type of analysis on public data, permutating each dataset to generate null distributions? This would give me test statistic distribution for each gene in each dataset. I can then compare the test statistic (Spearman's rho) in my data with those of public datasets, and see what types of p values it generates?
If so, it's an interesting (if labour intensive) idea, that could provide reference values that I could use for future studies with new datasets. The challenge is the test statistic itself, which here is correlation with a clinical value. In the public datasets, I guess I would need some parameter in a dimension compatible with the public expression data to perform a similar type of correlation, and if not present, I would probably need to generate it (ideally with distribution characteristics similar to my own clinical variable of interest)?
yeah sorry i was pretty lit when I wrote that. it would be a lot easier simply to determine whether or not there is a batch effect (easy, lmk if need), and, if not, do the permutation testing only on your own data. the multiple sample approach was to see if there are issues with batch in your data but thats probably not necessary.
doing that experiment would give you an idea about characteristics of covariation / coexpression in many different diseases or tissues or what have you, which is inherently useful because your first question related to mutliple test correction.
the idea behind the bonferonni correction is that all the tests are independent. however, if the data points are correlated, they are non-independent, rendering it excessively conservative for certain uses.
one way around this is to calculate the number of "effective" genes in your sample, which is eessntially. metric for how many independent axes of expression there are.
in general because most people use FDR, you may get questions if you use something else.
finally, 20000 is too many. you will almost certainly not have 20000 testable genes in your dataset. remove all genes that are: very low expression values, absent from your data, fail QC metrics, etc. if you narrow down to genes that are sufficiently expressed to give you a decent read out on the front end, you avoid paying multiple testing penalty on back end for genes you aren't sufficiently powered to look at anyway... if you need more help you can find my twitter or email and hit me up. good luck
in general its a lot less worrisome if batch does not correlate with case control status. so, yes. however, a strong enough batch effect can alter type I and II error unless its controlled for quite well.
Thanks for all those points. Appreciate it. There's no batch effect in this data - there's only one "disease" with a disease severity index that is variable between patients but doesn't very significantly between batches, and that is what I'm trying to correlate to genes. Based on this conversation, I'm going to make sure I have proof of this available in case reviewers ask.
this technically is not true. you could have a test statistic that is best powered to detect an association, yet still get strong deviation from this result, if the predictors are correlated, which they almost always are for most/all types of omics data.
this estimate supposes the statistics are uncorrelated, which is untrue since genes are coordinately regulated...
Yea, that's why I added the condition that it needs to be the correct test for the data. The formula still likely applies to their test though as an approximation, as regardless of the correlation of genes in the dataset they are performing orders of magnitude more comparisons than one would need to get a false positive rate of close to 1.
hi rpolicastro! big fan of your work. i dont think i understand what you mean by "best test for the data". the count regression models implemented in DESeq, for instance, are very good for calculating statistics for individual gene differential expression. therefore, i dont think its fair to say the arent "[good] tests for the data".
my point is rather that, you could even have a closed-form proof that a test-statistic is optimal, but that test-statistic wouldn't become any less optimal if it is used once, twice, or 800,000 times... my comment is rather about the distribution of test statistics that will result due to running these tests (that are individually fine) together, then treating them as if they are independent. thats why i commented, "you could have a test statistic that is best powered to detect an association, yet still get strong deviation ... "
i think fundamentally we probably agree, maybe i was splitting hairs about distribution test statistics under Ho and H1 without adequately clarifying it. VAL
While we may all be splitting hairs, these discussions are helpful for learners like me!
Could correlation between transcriptional features and a phenotype be such a test-statistic?