Question: Rare Variant Association Tests and ExAC Data Set
I have a question, actually a two part question.

I have whole genome data on 120 cases and 20 controls. They are all unrelated individuals. I have a LoF variant that appears 5 times in cases and 0 in controls. This variant is also absent from all population databases (ExAC, gnomAD, 1000GP etc.). In fact, the gene has just 1 LoF variant in ExAC and the pLI is 0.97 indicating LoF intolerant. The functional genetic evidence also indicates that this gene is related to the phenotype that I'm studying.

I need to actually associate this variant/gene with the phenotype but given my small number of controls this is a problem. I was simply going to use Fisher's exact test. My question is:

What test should I use? How might the ExAC or gnomAD data be used in a case/ control analysis? For example, I was going to obtain all the European samples (my samples are mostly of European ancestry) and use all the high quality calls at this site as part of my control. I downloaded the gnomAD whole genome VCFs (they don't include the genotype field). These were created using a similar variant calling, annotation, and filtering pipeline to my own. In fact I followed these practices as mich as I could when processing my own data.

Anyways, my idea was simply to use the European samples from ExAC or gnomAD with high quality calls at the site where my variant is located. So if there's 6000 of such calls then I can use this figure in Fisher Exact Test along with the 20 controls.

Does anyone have anything to suggest about this? I mean with 20 controls you're pretty much screwed I don't what else you can do.

Oh and I see Daniel MacArthur's name is on the Barret et al. paper as well. I think he's just warning people to be cautious. In my case I think this step is at least reasonable.

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

Philadelphia, PA
Unless these 5 carriers are distantly related it looks like you have a quite solid case to associate this variant with the disease. The easiest would be to get some more controls from your particular population and genotype the position of that variant. Not necessarily genome sequencing, just targeted resequencing. Something that you may want to do is to determine the haplotype this variant is on, and figure out how long it is to get an idea about the distance between your 5 carriers.

Furthermore, you need to make sure that these 5 carriers are not from an isolated population in which that variant is more frequent. How sure are you about their ethnicity? Note that you can use 1000 genomes data, for example, to perform a cluster analysis to see where those samples belong.

The main conclusion you can draw from the variant not being in ExAC is that the variant is very rare.

Hi, thanks for the response. clustering of the samples was already done and they are all of European ancestry getting more controls is not an option

That's a pity. No other labs you can collaborate with to get more controls?

This issue came up recently in connection with a high profile targeted sequencing study in autism by Stressman et al. which was retracted due to statistical flaws. Stressman et al. also did a case/control analysis using the ExAC data and Barret et al. 2017 was critical of that decision as well but noted that many of the flaws could be overcome by matching populations and checking that the sites have been well covered. To quite the Barret et al. 2017 paper:

"ExAC resource does provide summaries broken down by major continental ancestries and also provides coverage-depth information such that the data could be used much more accurately, albeit still imperfectly, in this context."

So yes, it's imperfect, but can also be reasonably accurate provided you take the proper steps, maybe. I don't "much more accurately" sounds like "fairly accurate" to me.

For case /control analysis however Stressman et al. used a permutation test which seems like the correct test to use in this case. For this test they "permute the labels of the cases and controls" whatever that means I've never done this myself.

