3.8 years ago by
United States
This is a very large question with no simple answer.
Here is what you should do:
1. Google "GWAS quality control"
2. Start reading papers like this one from Stephen Turner: "Quality Control Procedures for GWAS" http://www.ncbi.nlm.nih.gov/pubmed/21234875
3. As you read these papers (there are a couple dozen that will help you) start to take notes on what kinds of things they recommend. For instance, you will want to do QC by variant, by sample (individual person), by batch or plate, and by chip. Take notes on each of those.
Once you have a command of the literature, construct something like this:
I. Initial processing of new data
-
Genotype Calling (Illuminus)
-
X an Y probe intensity, Structural Variation (Illumina Bead Studio)
-
Coversion to bed bim fam (Custom, PLINK)
II.Sample QC
-
Sex Check (PLINK)
-
Missingness Outliers (PLINK)
-
Heterozygosity Rate Outliers (PLINK)
- i.Calculate observed heterozygosity per individual
-
Plot Missingness on X axis, Heterozygosity on Y. Decide reasonable thresholds for exclusion
-
Relatedness Checks
- i.Prune out high LD regions (e.g., HLA)
- ii.Prune down to 50,000 high quality, LD-independent SNPs
- iii.Check for IBD > 0.185, visualize (PLINK, R (turner))
- iv.Mark or exclude
-
Ancestry Checks (PLINK, smartPCA, R scripts)
- i.Extract SNPs not featured in Hapmap 3 Rel. 2 four ancestral populations
- ii.Merge with hapmap data, flipping hapmap strand
- iii.PCA on merged file
- iv.Plot PC loadings
- v.Determine all PCs having significant correlation to ancestry (R)
- vi.Exclude ancestry outliers (R)
-
Per Chip comparisons on a.-d. (Custom)
-
Exclude or mark all sample outliers
III.Marker QC
-
Excessive Missingness (PLINK)
- i.Select threshold based on visual inspection of histogram
-
HWE (PLINK)
- i.If a higher threshold is chosen, manually inspect cluster plot
-
Differential Missingness Check (PLINK)
- i.Informative Missingness – CNV
- ii.Consecutive Missingness in a stretch
-
Low MAF (PLINK)
-
Internal Sample Reproducibility (Between Chips) (PLINK)
-
External Sample Reproducibility (HapMap Concordance) (PLINK)
-
Per Chip Call Rate, AF, GF, comparisons on a.-d. (Custom)
IV.Batch Effects
-
Average MAF (PLINK, Custom)
-
Average call rates (PLINK, Custom)
-
Association Testing by plate (remove MAF <5%) (Custom, PLINK)
-
Correction via population stratification techniques if necessary
V.Dataset Merging and Harmonization
-
Sample Checks
- i.Must perform same checks as before on merged set.
- ii.Results should confirm previous relationships, find new related pairs.
-
HWE – after merging, high number of SNPs out of HWE due to differences in ancestry.
- i.Need to stratify by ethnicity, then look for HWE outliers p < 0.0001.
-
Population Stratification
- i.Use AIMs from Dumitrescu 2010
-
Marker Checks
- i.After removing 95% from single study, second check for 99% overall.
-
Batch Effects
- i.Test independence of AF with plate membership, and compare the distribution of chi-square statistics to the null distribution.
-
Merging
VI.Integrated imputation, phasing, and strand flipping
-
Genotype Harmonizer
- i.Across Study-Side Hapmap sample Concordance (GH)
- ii.Inspect original source file designation (GH)
- iii.MAF comparisons (GH)
VII.Association Testing
-
Post QC PCA
-
Decide between Logistic Regression and Mixed Modelling
- i.Degree of Relatedness
VIII.Evaluation of QC Quality after Association Analysis
-
Calculation of Lambda
-
Examination of Intensity Plots
-
Replicate SNPs of interest on a DIFFERENT Technology