Question

gene set enrichment in an exome cohort study

1

Entering edit mode

9.9 years ago

Quak ▴ 490

Basically, I have to start by saying that I don't know much about statistical genetics; however I understand statistics.

There is a cohort of 400 patients with a disease. A set of genes are hypothesized that are causative or have significantly more variants than usual.

Based on my understanding it is necessary to have a case cohort and showing that this set of genes are only enriched in the case set but not in the control set.

The NHLBI cohort contains almost 4000 individuals that are ethnically matched with the disease cohort.

I saw in this forum that people have mentioned of using "burden-test" or say "fisher-test". To my understanding, these methods, are comparing the frequency of variations in the population of the cohort.

say,

variant_name        1KG     NHLBI    Disease
snp1_from_gene1     0.03    0.01     0.1
snp2_from_gene1     0.7     0.02     0.2
snp3_from_gene2     0.3     0.01     0.1

and then, we compare the distribution of frequency in these between NHLBI vs Disease or 1KG vs Disease, to prove that these two distributions are not the same with a certain p-value. 1) is this correct ? can you make more explanation in this part? is this why burden_test is? (I don't think so)

2) as the second question, if I have 3 sets of control, say, NHLBI, 1KG and another disease cohort say (Autism) these 3 sets don't necessarily agree with each other, and possible, variant in NHLBI can be statistically significant compare to the 1KG. In the above example, one can compare 1KG vs NHLBI and see that there are significantly different distributions. One obvious reason is different variant calling methods. So, I wonder, what is the best strategy to have such comparison?

genome snp gene • 2.8k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Quak ▴ 490

Ram · Answer 1 · 2014-05-28

3

Entering edit mode

9.9 years ago

Katie D'Aco ★ 1.1k

Burden tests generally aggregate variants in a genomic feature (usually genes), and do statistical analysis by gene, instead of by snp. The introduction to the SKAT paper has a nice description of burden tests.

You might want to open a new thread for your second question, since it is really a different topic.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Katie D'Aco ★ 1.1k