Question: Whole exome sequencing data, rare variants and QQ-plots
7
gravatar for felejohs
3.5 years ago by
felejohs70
felejohs70 wrote:

Hi

I'm having a problem with a whole-exome-sequenced dataset consisting of about 400 human subjects, 200 cases with a certain disease, and 200 controls without. The dataset has been through a rigorous quality control (standardised QC in plink with HWE, IBD, missingness, sex-check ++, along with HapMap population stratification and Eigenstrat/PCA-analysis). I´m using plink to do a basic association-analysis for all variants between cases and controls, and while the resulting QQ-plot for the common (MAF > 0.01) variants is OK, the plot for the rare (MAF < 0.01) variants is less so. Below are the three QQ-plots for all, common and rare variants along with lambda-values:

The main problem seems to be the positive deviation (observed > expected) of the rare variants in the first part of the plot, causing the lambda to be very big, both for the QQ-plot for rare and all variants. I am wondering what could be the cause of this behaviour for the rare variants, and also what the implications this has for the prospects of doing analysis on rare and common variants together.

I would be grateful if anybody has any experience in these matters and could provide some input.

Many thanks.

ADD COMMENTlink modified 14 months ago by zx87548.7k • written 3.5 years ago by felejohs70
3
gravatar for Lemire
3.5 years ago by
Lemire450
Canada
Lemire450 wrote:

QC has likely nothing to do with your problem. Here are two things you need to think about:

  1. you're using a 2df test (genotypic?) for rare variants even though some of your cell counts are likely to be 0.
  2. You're using a qq-plot designed to assess the distribution of a continuous variable when yours only has a finite number of possible values due to the small cell counts, making the qq-plot uninterpretable.

Don't over think the qq-plot in that case.

ADD COMMENTlink modified 14 months ago by zx87548.7k • written 3.5 years ago by Lemire450

Thank you for your input. Is there another way to verify the quality of the rare variants? Or in other words, if the QC produces a good QQ-plots for the common variants, would you be satisfied and move forwards even though your planned analysis relies heavily on rare variants (collapsing rare variants on genes and pathways)?

ADD REPLYlink written 3.5 years ago by felejohs70
2
gravatar for Vincent Laufer
3.5 years ago by
Vincent Laufer1.1k
United States
Vincent Laufer1.1k wrote:

I would generate p-values using an exact test, and see if there is still inflation, personally.

ADD COMMENTlink modified 14 months ago by zx87548.7k • written 3.5 years ago by Vincent Laufer1.1k

Using fisher´s exact test, the inflation disappeared. Thank you everybody, you´ve been very helpful.

ADD REPLYlink written 3.5 years ago by felejohs70

thats great news! Thanks for letting us know. Glad you did not have to battle any of the fine-ancestry problems I had to deal with.

ADD REPLYlink written 3.5 years ago by Vincent Laufer1.1k
2
gravatar for Vincent Laufer
3.5 years ago by
Vincent Laufer1.1k
United States
Vincent Laufer1.1k wrote:

Principal components are good for catching large scale differences in population structure, but much less good at catching fine-scale differences between populations. Due to certain principles of population genetics and natural selection, these fine-scale differences generally tend to be captured in rare variation more so than in common variation. As a result, your pipeline might have done a lot to control for common variation, but much less to control for confounds introduced in rare variants.

Please see these papers:

These two papers will provide a good starting point for understanding what controlling for differences in population structure with PCA might miss.

If you have further questions, please let me know.

ADD COMMENTlink modified 14 months ago by zx87548.7k • written 3.5 years ago by Vincent Laufer1.1k

Hi

Thank you for your answer. That is very interesting. My understanding is that most QC methods and protocols have been developed for common variants (GWAS), so I´ve always been curious how they cope with rare variants in exome-data. If I´m reading these papers correctly they used a varying number of principal components to control for these subpopulations in the data.

As I understand it Eigenstrat was originally developed for GWAS. Would using PCs generated by eigenstrat as covariates work for my data to help control for these spurious associations?

ADD REPLYlink written 3.5 years ago by felejohs70
1

Hello! If there are in fact fine scale population structure differences, PC will not catch them. Thus, potentially confound could slip in. I'd start with the suggestion below.

ADD REPLYlink written 3.5 years ago by Vincent Laufer1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 899 users visited in the last hour