Question

Dependent p values

0

Entering edit mode

13 months ago

Eliza ▴ 30

Hi , I have data that has 3 column SNPs their gene based on Annovar and a pvalue for every SNP . What I would like is to aggregate the p values for every gene . I know that their is a dependence between the pvalues and I don't know the dependence structure. so fisher method won't work . I read about some methods here : https://arxiv.org/pdf/1212.4966.pdf . And came upon this :

6.1 A rule of thumb First we state a crude rule of thumb for choosing r. Since any method based on the observed values of p1, . . . , pK would affect the validity of the method (see Subsection 6.3), we have to rely on prior or side information for a suitable choice of r. As a rule of thumb, if there is potentially substantial dependence among the p-values, then we should not use Bonferroni, and the harmonic mean might be a safer choice. If we are certain that the dependence is really strong, then the geometric and the arithmetic means might be an even better option. See Subsection 6.4 for a simulation study illustrating this poin.

Based on the article would be happy if you could share your knowledge in this situation what technic is best to use for pvalue aggregation.

pvalue snp Gene • 1.1k views

ADD COMMENT • link 12 months ago by Eliza ▴ 30

score 1 · Answer 1 · 2023-03-07

1

Entering edit mode

13 months ago

LChart 3.9k

Hi Eliza-

What matters is the dependency structure of the p-values under the null. There are a number of ways to deal with this; and the primary question is whether you can re-generate that 3-column file given new data, as that would enable you to perform a permutation test, or to apply any number of the many gene collapse tests such as SAIGE [1] or SKAT-O [2].

Failing that, you could take an LD-based approach as in sumFREGAT [3] or Overall [4], which would treat the dependency structure as LD-based correlations of Z-scores.

[1] https://www.nature.com/articles/s41588-022-01178-w

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3415556/

[3] https://academic.oup.com/bioinformatics/article/35/19/3701/5376511?login=false#394113760

[4] https://www.nature.com/articles/s41598-022-07465-0

ADD COMMENT • link 13 months ago by LChart 3.9k

0

Entering edit mode

LChart Thank you . Could you explain a little more : " and the primary question is whether you can re-generate that 3-column file given new data, as that would enable you to perform a permutation test". The data I have is from an experiment and I can't generate a new one. . The pvalues I got for every SNP (that I want to aggregate based on gene ) were calculated using Armitage test

ADD REPLY • link 13 months ago by Eliza ▴ 30

0

Entering edit mode

Sure, but can you switch around the labels in your data and re-run the trend test? If so, it means you have access to the underlying genotype and phenotype data, and can either manually run a permutation test (in which case you can select the minimum p-value and use permutations to understand the distribution of the null), or apply SAIGE/SKAT-O.

ADD REPLY • link 13 months ago by LChart 3.9k

0

Entering edit mode

What do you mean switch the labels in the data ? I had data on patients with some mild and severe disease. So for each snp I counted how many patients that and heterozygous homozygous or didn't have that mutation and on that I did the Armitage test , so what do you mean by switching the labels? Do you find maybe any of the methods in the article helpful, I thought about the using the harmonic mean?

ADD REPLY • link 13 months ago by Eliza ▴ 30

0

Entering edit mode

Because you have the raw genotype data, you can randomize the patient labels, which forms the basis of a https://en.wikipedia.org/wiki/Permutation_test.

In addition, there are tools already written for performing multi-variant association tests within genes. I have linked two of them.

The article you referenced is not pointing you in the correct direction, I'm afraid.

ADD REPLY • link 13 months ago by LChart 3.9k

0

Entering edit mode

following your suggestion if i want to use :LD-based approach in sumFREGAT I read that i can obtain the LD information from 1000 Genomes Project BUT it may not be relevant or applicable to my study population -> can i still use it?

ADD REPLY • link 13 months ago by Eliza ▴ 30

0

Entering edit mode

LChart I'm sorry for asking another of questions but I'm new to this field , can you please explain a little more how to use SKAT , saige in my case I didn't understand fully the toturials online about this and how can I implement in on my data like what input files should I have

ADD REPLY • link 12 months ago by Eliza ▴ 30