Quality control filter by genotype counts in a GWAS?
1
0
Entering edit mode
9.1 years ago
User 7754 ▴ 250

Hi,

I am working with a study that has genotyped data and applying the following filters to clean the data, keeping only the SNPs that have:

CALL_RATE >0.95
HWE_P >1e-6
MAF>0.001

However I am finding some SNPs with an extremely low P-value, and when I look at these SNPs better I find that the genotype counts for homozygote major, heterozygotes, and homozygote minor are very unequal, for example in a sample of 500 they are like this:

N       N0      N1      N2
500   0  1  499

So I am tempted to also filter these cases out, for example requiring that

N0 | N1 | N2 >2

But this way I remove 1/3 of my data. I can't find this in the literature as a usual step for QC in GWAS, is this not done usually? If it is, what is the minimum number acceptable for genotype counts?

Thank you very much for your help!

Fra

filter gwas • 3.2k views
ADD COMMENT
0
Entering edit mode

Do you actually have 500 people in your sample? Because then MAF > 0.001 won't be very useful....

ADD REPLY
0
Entering edit mode

Yes I do...I realise that is a very low limit but I am setting it the same for all studies for a meta-analysis...

Is that why I am getting 0 for some of the genotype counts? Is the genotype count usually set as a filter additionally to the MAF and other filters I already have?

Thank you for your help

ADD REPLY
2
Entering edit mode
9.1 years ago

This is a standard GWAS step, frequently handled with plink's --geno flag (http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml#miss1). [edit: This is incorrect, I misread the original question; refer to the comment about --maf and --hwe instead]

ADD COMMENT
0
Entering edit mode

Thank you chrchang523. I thought the filter --geno was used for missing genotypes, but is it actually filtering based on the three genotype counts? So if I followed the filters of --geno 0.1 in Plink, would this be equivalent to manually filtering out the SNPs with less than 50 individuals in any of the genotype groups (with N=500), so N0 | N1 | N2 >50?

ADD REPLY
0
Entering edit mode

Oh, sorry, I misread your question.

A combination of --maf (or with the plink 1.9 development build, you can also use --mac) and --hwe should work for your use case. Very low homozygote counts will be filtered out by --maf, while very low heterozygote counts when the homozygote counts are higher will be filtered out by --hwe.

ADD REPLY
0
Entering edit mode

Thank you very much for your clarification! To solve this then would you think I just need to make the thresholds I already have in place stricter? Instead of for example HWE_P >1e-6 ; MAF>0.001, use HWE_P >1e-4; MAF>0.01?

Using stricter signals indeed helps a lot! However there will still be associations that have only 1 individuals (for example the SNP below with 1 homozygote major), but maybe this could be considered a real signal?

SNP                  N     N0    N1   N2    MAF         HWE_P   CALL_RATE       PVAL
12:112543881         500   1     16   483   0.01613     0.13    1               1e-07

A related question is whether to apply these same filters to all the studies independent of sample sizes. This study is part of a pipeline applied to many studies in preparation for a meta-analysis, so we had decided that all the studies should have the same filters for QC applied to them.... In contrast to this approach, would you suggest to use different filters for the studies with a small sample size such as this one? Thank you so much for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6