I received genome-wide association (GWAS) data from a colleague who's supposedly done all the imputation and quality control according to the consortium's standards. Genotyping was Illumina 660, imputed to HapMap (3.2 million SNPs total).
The data came to me as a matrix of 11,000 samples (rows) and 3.2 million SNPs (columns). There's a header row for each SNP, and genotypes are coded as the number of minor alleles (or allele dosage for imputed SNPs).
Here's a few rows and columns to show you what it looks like:
rs1793851 rs9929479 rs929483 2.0 0 1 1.6 0 1 2.0 NA 0 2.0 0 1 1.6 0 0 2.0 1 NA 1.0 0 0 1.9 0 2
I've always used PLINK for GWAS data management, QC, and analysis because of its efficient data handling capabilities for GWAS data. However, this kind of data can't be imported directly into PLINK or converted into a pedigree format file. (PLINK does handle imputed data, and so does SNPTEST, but both of these require genotype probabilities and I only have the expected allele dosage).
I did write some R code to read in the data in chunks and run some simple summary and association statistics, but this is clunky and suboptimal for many reasons:
- The dataset first has to be split up (I used a perl wrapper around UNIX/cut to do this). After splitting the dataset into several hundred files with all my samples and a subset of SNPs, computing sample-level measures (sample call rate, relatedness, ethnic outliers) is going to be a real coding nightmare.
- Subsetting analyses is going to be difficult (not as easy as PLINK's --exclude, --include, --keep, --remove, --cluster, etc).
- PLINK integrates SNP annotation info (in the map file) to your results. Joining QC and analysis results to genomic position, minor allele, etc, will require lots of SQL joins.
Ideally I don't want to rewrite software for GWAS data management, QC, and analysis. I've considered (1) analyzing only genotyped SNPS, or (2) rounding the allele dosage to the nearest integer so I can use PLINK, but both of these methods discard useful data.
Does anyone have any suggestions on how I should start to QC and analyze this data without re-inventing the wheel or rewriting PLINK? Any other software suggestions that could take this kind of data? Keep in mind, my dataset is nearly 100GB.
Thanks in advance.