Question

Computing R Squared Statistics For Imputation Results - How?

4

Entering edit mode

10.5 years ago

Click downvote ▴ 720

I have two plink files, one containing the actual SNPs and one containing the result of imputing the same SNPs using a much smaller subset.

What would be the simplest way of getting the r^2 (r squared) scores?

plink has a merge mode that calculates a diff file, but I do not know if it outputs any statistics: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#merge

The R package snpStats can read plink data, but it does not seem to be able to compare two SNP sets as far as I can see: http://www.bioconductor.org/packages/2.12/bioc/html/snpStats.html

There is a program called http://genome.sph.umich.edu/wiki/CalcMatch that seems to be able to calculate these stats, but require the files to be in merlin format which seems totally obscure and like a hassle converting to it in my pipeline. But might be the best option so far. A conversion script is here: http://www.pypedia.com/index.php/bioinformatics_format_convert

imputation • 8.4k views

ADD COMMENT • link updated 9.0 years ago by Biostar 20 • written 10.5 years ago by Click downvote ▴ 720

score 4 · Answer 1 · 2013-10-18

If typed SNPs are not in 0,1,2 (RAW) plink format, then convert them to raw format using --recodeA option. If the imputed SNPs are posterior probabilities(3 values per SNP) then convert them to dosage(one number ranged 0-2), e.g.: if SNPx for sampleY is 0.3, 0.6, 0.1, then to convert it to dosage: 0.3*0+0.6*1+0.1*2=0.8. Then just use cor to get correlation.

EDIT: Added solution Rscript with example files to GitHub.

Ram · Answer 2 · 2013-10-18

http://www.g3journal.org/content/1/6/457.full:

Once every individual in a panel had been masked and imputed, we assessed accuracy at each SNP as the squared Pearson correlation (R2) between the masked genotypes, which take values in {0,1,2}, and the imputed allele dosages (also known as posterior mean genotypes), which take values in [0,2]

So in my case wouldn't use Plink data, but the dosage files and just compute use the regular cor(x, y, method = c("pearson")) or something similar in NumPy