gene identification by exome seq of case control
2
0
Entering edit mode
8.8 years ago
peris ▴ 120

Hi,

I have sequenced exomes of case and control and generated the SNP file. For the gene of my interest I can see SNPs with both cases and control. Now, to confirm whether, this gene is responsible for the disease, what sort of statistic I need to use considering I have multiple rare variations both in case data-set as well as control data-set for that particular gene.

snp gene • 2.9k views
4
Entering edit mode
8.8 years ago

A quick note: There does not currently, nor will there likely ever (unless we're counting StarTrek-like holodeck simulations as a statistical test...in which case I guess perhaps someday), exist a statistical test that will determine whether a SNP is causative. You can use statistics to demonstrate significant enrichment/association but not causation, which will require some bench work and possibly making a mouse model.

Anyway, the most common test to see if a SNP is significantly associated would be a Fisher's exact test. The R function is fisher.test().

0
Entering edit mode

Thanks Devon for the explanation and the suggestion. I also want to ask you one additional query. If a gene has 7 SNP both in case dataset and control dataset; but the number of individual is different; then how should I do the association study. I think I cant do it SNP by SNP.

Am sorry for asking such basic question; but I am very new to human genetics.

1
Entering edit mode

By "the number of individual is different", do you mean between cases and controls or by SNP even within the cases and controls? Having different numbers of cases and controls is extremely common and not an issue at all (usually you have a LOT more controls than cases, since controls are easy to come by). Depending on the exact nature of the disease and data, you can sometimes test things as a group, as we did here (see supplemental table S4). You could also test SNPs individually, but that often makes more sense when you're looking at complex diseases and I assume you're working on a more Mendelian disorder.

0
Entering edit mode

Hi Devon,

Thanks for sharing the paper. Yah I am working on Mendelian disorder and will try to follow your advice.

3
Entering edit mode
8.8 years ago
rbagnall ★ 1.8k

I agree with Devon's reply, but I also wonder of you are looking for a gene burden test. That is, are there more rare variants in gene X (possibly damaging or not) in my cases compared to the number of rare variants in gene X in my controls (possibly damaging or not).

Take haemophilia A for example; a group of cases would have more rare variant in the Coagulation Factor VIII gene than controls would (it's a monogenic disease and almost all cases are explained by mutations in FVIII). It doesn't tell you which of the variants are pathogenic, but collectively, rare mutations in FVIII gene are more numerous in cases than controls.

Since you have a variant file (presumably vcf) of cases, and controls, you should have a look at plink/seq.

they have an awesome tutorial to work through some example data and you should to work towards the burden tests (under association tests)

Having said that, and with a nod to Devon's caution, you need to be sure the cases and controls are ethnically matched, and that the variants been called within the same exome (exomic?) regions, and sequenced using the same method, and have sufficient power to detect a significant enrichment, and boldly go where no man has gone before...

0
Entering edit mode

Your ethnic matching comment is really important. It can't be stressed enough how easily it is to shoot yourself in the foot if you don't match populations as closely as possible.

0
Entering edit mode

Hi rbangali

Thanks for your suggestion. I tried plink/seq on my vcf. But I am clear in interpreting the outcome the plinkseq.

Bellow is an example of plink association output

NM_000445    chr8:144990997..144991082    PLEC 5    BURDEN    1    0.2    7/12


1
Entering edit mode

7/12: 7 non-reference genotypes in the cases/12 non-reference genotypes in the controls.

BURDEN: is the test used

1: is the p-value

0.2: is the I value (it sort of tells you how significant the result could be, and it seems you will not be able to reach a significant value with this gene in this analysis)

0
Entering edit mode

Thanks for your nice explanation. If I understood correctly, I value should as low as possible and non-reference genotype should be more in cases than control if the gene is supposed to be associated with the disorder.

0
Entering edit mode

Hi can someone help me with the interpretation of these results please:

LOCUS        POS                        ALIAS       NVAR    TEST      P    I      DESC
NM_000016    chr1:76190072..76229221    34,ACADM    64      BURDEN    1    0.2    313/147


I only have 12 cases and 28 controls - and I get (313/147) calculated from 64 variants? I'm not sure whether this number is supposed to be within the range of the number of cases and controls one has...

Also what is the usual threshold for I, P?

cheers