Question: How To Analyze Imputed Gwas Data
gravatar for Stephen
6.6 years ago by
Charlottesville Virginia
Stephen2.6k wrote:

I received genome-wide association (GWAS) data from a colleague who's supposedly done all the imputation and quality control according to the consortium's standards. Genotyping was Illumina 660, imputed to HapMap (3.2 million SNPs total).

The data came to me as a matrix of 11,000 samples (rows) and 3.2 million SNPs (columns). There's a header row for each SNP, and genotypes are coded as the number of minor alleles (or allele dosage for imputed SNPs).

Here's a few rows and columns to show you what it looks like:

rs1793851 rs9929479 rs929483
      2.0         0        1
      1.6         0        1
      2.0         NA       0
      2.0         0        1
      1.6         0        0
      2.0         1        NA
      1.0         0        0
      1.9         0        2

I've always used PLINK for GWAS data management, QC, and analysis because of its efficient data handling capabilities for GWAS data. However, this kind of data can't be imported directly into PLINK or converted into a pedigree format file. (PLINK does handle imputed data, and so does SNPTEST, but both of these require genotype probabilities and I only have the expected allele dosage).

I did write some R code to read in the data in chunks and run some simple summary and association statistics, but this is clunky and suboptimal for many reasons:

  1. The dataset first has to be split up (I used a perl wrapper around UNIX/cut to do this). After splitting the dataset into several hundred files with all my samples and a subset of SNPs, computing sample-level measures (sample call rate, relatedness, ethnic outliers) is going to be a real coding nightmare.
  2. Subsetting analyses is going to be difficult (not as easy as PLINK's --exclude, --include, --keep, --remove, --cluster, etc).
  3. PLINK integrates SNP annotation info (in the map file) to your results. Joining QC and analysis results to genomic position, minor allele, etc, will require lots of SQL joins.

Ideally I don't want to rewrite software for GWAS data management, QC, and analysis. I've considered (1) analyzing only genotyped SNPS, or (2) rounding the allele dosage to the nearest integer so I can use PLINK, but both of these methods discard useful data.

Does anyone have any suggestions on how I should start to QC and analyze this data without re-inventing the wheel or rewriting PLINK? Any other software suggestions that could take this kind of data? Keep in mind, my dataset is nearly 100GB.

Thanks in advance.

gwas imputation plink snp R • 28k views
ADD COMMENTlink modified 10 months ago by ShirleyDai20 • written 6.6 years ago by Stephen2.6k

Have you tried Beagle and PRESTO?

ADD REPLYlink written 6.6 years ago by Jarretinha3.2k

Looked through all the beagle utilities and didn't see much that would help.

ADD REPLYlink written 6.6 years ago by Stephen2.6k
gravatar for Avsmith
6.6 years ago by
Reykjavik, Iceland
Avsmith210 wrote:

For imputation, the primary QC really should be done on the input genotypes, rather than directly on the imputations. Prior to imputation, I remove SNPs at MAF < 0.01, HWE < 1e-06, and only used SNPs present on 97% of the samples. Also, since Illumina data has relatively few A/T and G/C SNPs, I also remove those, as it completely removes any potential strand issues (most of my runs are meta-analyzed with other studies, so I think this step is worthwhile). In my opinion, these steps are a key part of good imputations.

However, you already have imputations, so I will assume that they are to high quality. The exact answer to your question depends on the software used for imputation. Personally, I use MACH for most of my imputation work, and I use ProbABEL for association analysis. ProbABEL reads the MACH files directly without modification. Also MACH2QTL and MACH2DAT can be used similarly, but I don't have any experience with those. In the analysis, ProbABEL does track the Imputation Quality score from the input files, and typically those are filtered at R2<0.3, but that is typically done as part of the meta-analysis.

Some points:

  • ProbABEL reads the entire file into memory, which will be sizable for large chromosomes and sample size.
  • The newer versions of ProbABEL have support for a "filevector" format which gets around these issue. I've just started experimenting with this, and it will remove the memory issues I just mentioned. If the documentation is unclear, there is a better explanation here on how to use the files. The "filevector" format needs to be made via the mach2databel function of GenABEL.
  • To get this all to work, I have a custom job script written in Perl to queue the jobs, and I usually do several in parallel via the Parallel::Forkmanager module, but you should choose your own poison.
  • ProbABEL can't take compressed files as input, which is not optimal, because the uncompressed files are huge. MACH2QTL and MACH2DAT can take compressed files, but I've still not jumped to that.
  • If your impuations came from IMPUTE, SNPTEST can take the input directly, but I have no experience with SNPTEST.
  • The GenABEL package also has a impute2databel function to make filevector format for use with ProbABEL. I've not worked with this.
  • Previous answers cited association software for use with Beagle imputations. I have no experience with that myself.
  • These softwares typically allow for fairly standard association models, and if you want "unsupported" models, the work arounds are much slower. I've loaded all my genotypes into netcdf files and work via R with the ncdf library.
  • I don't recommend working with the "rounded" genotypes, as the decimal values appropriately capture uncertainty in the imputations.
  • You don't want to read the unrounded genotypes into R, as R does not support floats. Everything is read as double, and memory quickly becomes an issue. That is the reason for the netcdf workaround.

Hopefully, this gives you some insight.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Avsmith210

Since I wrote this last night, all the GenABEL & ProbABEL links appear to have changed. Will re-edit after the links are all fixed at the new site.

ADD REPLYlink written 6.6 years ago by Avsmith210

Thanks Albert. I've written some code to read in SNPs in "chunks" - groups of 10,000 snps at a time. However because all the files are text and because imputed genotypes are, as you said, read as double, this is extremely memory intensive and SLOW.

I'm wondering now if it would be better to look into netCDF or DatABEL.

ADD REPLYlink written 6.6 years ago by Stephen2.6k
gravatar for Adairama
6.2 years ago by
Adairama60 wrote:

Actually PLINK does now support dosage formats. a) You need to set format=1 and b) the dosage column names needs to "FID IID" format.

plink --dosage test.trdose format=1 --fam test.fam

Of course you need to transpose the mldose file (and rename "Al1" to "A1" and "Al2" to "A2"), as you would with the mlprob file. See for other option arguments.

ADD COMMENTlink written 6.2 years ago by Adairama60

Can you extract certain SNPs for further analyses in the dosage files in plink?

ADD REPLYlink written 4.4 years ago by Sheila180
gravatar for Avsmith
6.6 years ago by
Reykjavik, Iceland
Avsmith210 wrote:

As I mentioned in my previous answer, the packages mentioned above are good at set models, but can't alway do exactly the analysis you'd like to do. When doing that, I typically work via netcdf stored versions of the genotypes. I don't work this way unless necessary, as it is MUCH, MUCH slower compared to these optimized packages which typically can read all the genotypes as one go.

As you've requested an example on how to do this, I've put some code. (I did edit this slightly to remove some local stuff, and I didn't test the edited version. Should be close to working, still).

This code loads imputations from MACH, but hopefully the logic is apparent enough that you can modify as needed.

load.dosages.ncdf <- function(chromosome) {
    imputed <- sub("@", as.character(chromosome), "chr@.dose")
    idfile <- "sampleIDs.txt"
    ids <- read.table(idfile, header = TRUE)
    snpfile <- sub("@", as.character(chromosome), "")
    snps <- read.table(snpfile, header = TRUE)
    #Number of SNPs
    NSNPS <- nrow(snps)
    #Number of Inds
    NID <- nrow(ids)
    # Create the dimensions for SNPs and Inds (2D netcdf)
    snpdim <- dim.def.ncdf("snp", "bases", 1:NSNPS)
    sampledim <- dim.def.ncdf("id", "th", 1:NID, unlim = TRUE)
    # Variatble for IDs (NOTE: netcdf is not good at holding
    #   text, so only storing integernames)
    varID <- var.def.ncdf("sampleID", "person", dim = sampledim, 
        missval = 0, prec = "integer")
    # Variable for SNPs (NOTE: this will be the position of the
    #   SNP)
    varSNP <- var.def.ncdf("pos", "bp", dim = snpdim, missval = 0, 
        prec = "integer")
    # Create dimension for dosasges. Float used to save space
    #   in files
    vargenotype <- var.def.ncdf("dose", "dose", dim = list(snpdim, 
        sampledim), missval = -1, prec = "float")
    # Holderss for alleles, will be entered in bytye
    varrefallele <- var.def.ncdf("ref", "base", dim = snpdim, 
        missval = 0, prec = "byte")
    varaltallele <- var.def.ncdf("alt", "base", dim = snpdim, 
        missval = 0, prec = "byte")
    # rsID for SNPs, stored as integers
    varsnpid <- var.def.ncdf("rsid", "base", dim = snpdim, missval = 0, 
        prec = "integer")
    # Store both MAF and EAF for tracking.
    # Would be easy to calculate on the fly, but kept here from
    #   laziness
    vareaf <- var.def.ncdf("EAF", "freq", dim = snpdim, missval = -1, 
        prec = "float")
    varmaf <- var.def.ncdf("MAF", "freq", dim = snpdim, missval = -1, 
        prec = "float")
    # Track inmputation quality statistics
    varqual <- var.def.ncdf("Quality", "qual", dim = snpdim, 
        missval = -1, prec = "float")
    varrsq <- var.def.ncdf("Rsq", "rsq", dim = snpdim, missval = -1, 
        prec = "float")
    # Flag for whether the SNPs is imputed or not
    varimp <- var.def.ncdf("usedimp", "imp", dim = snpdim, missval = -1, 
        prec = "byte")
    # Create output file name
    outfname <- sub("@", as.character(chromosome), "chr@.mldose.ncdf")
    # Create output ncdf file
    outfile <- create.ncdf(outfname, list(varID, varSNP, varrefallele, 
        varaltallele, varsnpid, vareaf, varmaf, varqual, varrsq, 
        vargenotype, varimp), verbose = FALSE)
    # Load details on SNPs
    put.var.ncdf(outfile, varSNP, as.integer(snps$POS), start = 1, 
        count = NSNPS)
    put.var.ncdf(outfile, varsnpid, as.integer(sub("^(rs|is)", 
        "", snps$SNP)), start = 1, count = NSNPS)
    put.var.ncdf(outfile, varrefallele, match(snps$Al1, c("A", 
        "C", "G", "T"), nomatch = 0), start = 1, count = NSNPS)
    put.var.ncdf(outfile, varaltallele, match(snps$Al2, c("A", 
        "C", "G", "T", "-"), nomatch = 0), start = 1, count = NSNPS)
    put.var.ncdf(outfile, vareaf, snps$Freq1, start = 1, count = NSNPS)
    put.var.ncdf(outfile, varmaf, snps$MAF, start = 1, count = NSNPS)
    put.var.ncdf(outfile, varqual, snps$Quality, start = 1, count = NSNPS)
    put.var.ncdf(outfile, varrsq, snps$Rsq, start = 1, count = NSNPS)
    # Load details on IDs
    put.var.ncdf(outfile, varID, ids$IID, start = 1, count = NID)
    # File with genotypes
    genotypes <- file(imputed, "r")
    print(paste(NSNPS, "SNPs to process"))
    # Iterate over file to load
    for (i in 1:NID) {
        # Scan in a single line
        snpi <- scan(genotypes, what = character(0), nlines = 1, 
            quiet = TRUE, skip = 0)
        # Load the dosages (from MACH imputation, skip the first
        #   two entries.  File has one line per ind)
        put.var.ncdf(outfile, vargenotype, vals = as.double(snpi[3:length(snpi)]), 
            start = c(1, i), count = c(NSNPS, 1))
        # Tracking the results
        if (!(i%%100)) 
            print(paste("individual #", i, " of ", NID, sep = ""))
# Loop over the chromosomes to load
for (i in 1:23) {
    print(paste("Starting: chr", i, sep = ""))

Then, once all these files are created, the following code does a Cox Regression. (I realize this is a model supported by ProbABEL, but it gives you a sense of how to work with netcdf files.)


FILE <- "data.txt"
pheno <- read.table(FILE, header = T)
# Function to run COX regression
# This the main thing to change when running
# different models
coxfitter <- function(geno, phenodf, formula, init) {
    environment(formula) <- environment()
    formula <- update(formula, . ~ geno + .)
    model <- coxph(formula, init = init, data = phenodf)
    beta <- coef(model)["geno"]
    se <- coef(summary(model))["geno", "se(coef)"]
    pz <- coef(summary(model))["geno", "Pr(>|z|)"]
    return(c(beta = beta, se = se, n = model$n, pz = pz))
# Loop over the chromosome.
# Current test form only does chr22
# Change to 'for (CHR in 1:22){' for entire genome
for (CHR in 1:23) {
    # Open NetCDF file
    genonc <- open.ncdf(sub("@", as.character(CHR), "chr@.mldose.ncdf"))
    # Get list of SNPs
    SNPs <- get.var.ncdf(genonc, "rsid")
    # Get list of IDs
    IDs <- get.var.ncdf(genonc, "sampleID")
    # Number of SNPs
    NSNP <- dim(SNPs)
    # Number of Samples
    NID <- dim(IDs)
    # Placeholders for results
    beta <- rep(NA, NSNP)
    se <- rep(NA, NSNP)
    n <- rep(NA, NSNP)
    pz <- rep(NA, NSNP)
    ref <- c("A", "C", "G", "T")[get.var.ncdf(genonc, "ref")]
    alt <- c("A", "C", "G", "T")[get.var.ncdf(genonc, "alt")]
    id <- get.var.ncdf(genonc, "rsid")
    pos <- get.var.ncdf(genonc, "pos")
    # Used to compare order in dosages and input samples
    keep <- match(pheno$IID, IDs)
    # Basic model
    m <- coxph(Surv(FOLLOWUP, PHENO) ~ SEX + AGE, data = pheno)
    formula <- formula(m)
    init <- c(0, coef(m))
    # Get the dosages for the current chromosome
    for (i in 1:NSNP) {
        dose <- get.var.ncdf(genonc, "dose", start = c(i, 1), 
            count = c(1, NID))
        dose <- dose[keep]
        f <- coxfitter(dose, pheno, formula, init)
        beta[i] <- f[1]
        se[i] <- f[2]
        n[i] <- f[3]
        pz[i] <- f[4]
    # Write out the results
    chrom <- rep(CHR, NSNP)
    results <- data.frame(chr = chrom, snp = paste("rs", id, 
        sep = ""), pos, ref, alt, strand = "+", n, beta = signif(beta, 
        5), se = signif(se, 5), p = signif(pz, 5))
    write.table(results, paste(PHENO, ".coxph.", "chr", CHR, 
        ".results", sep = ""), sep = "\t", quote = FALSE, row.names = FALSE)

Note: I used ncdf package 1.8 in R. The CRAN repository version (1.6.x) has had a bug where missing values of byte variables were not coming as NA in R. Because of that I moved to 1.8. The 1.6 has been bumped since I found the bug, so I'm not sure whether the bug still exists in 1.6.5 now. The author of the package recommended I use 1.8. R ncdf 1.8 can be found here.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Avsmith210

I should add that I am intrigued by compression options in ncdf4, as these files are HUGE. However, I've never ported over to ncdf4.

ADD REPLYlink written 6.6 years ago by Avsmith210
gravatar for Genotepes
6.6 years ago by
Nantes (France)
Genotepes880 wrote:

Looks like you have data input for MACH2DAT program for dosage data. However, I guess ProbABEL can do it as well.

You are supposed to have a file with SNPs description and a phenoytype file.

However, I don't know of any way of reordering ot handling the data - whereas it could be done in IMPUTE.

ADD COMMENTlink written 6.6 years ago by Genotepes880
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1403 users visited in the last hour