Question: How Can I Compute The R Square Hat Statistic For Imputed Data?
1
Kantale120 wrote:

Hi,

I have imputed a large dataset with mach / minimac and that resulted in 9 TB of data. The next step of my analysis is to compute the r square hat metric on order to assess the quality of the imputation. (I know that minimach contains it's own quality metric, but I am interested in r square hat particularly).

The QuickTest tool can extract this metric with the option: --compute-rSqHat which according the documentation:

Compute r−squared-hat for each SNP, which is the (estimated) fraction of variance in unobserved 0/1/2 genotype explained by the the individual mean genotypes. (We assume this is the same deﬁnition used by Abecasis et al.)

The problem is that in order to run quicktest I will have to convert 9TB of data to the QuickTest format. To save this hassle I would prefer to write a script to calculate this metric. So, does anyone know how to compute the "r squared hat" from dosage data (or from imputation a-posteriori probabilities) ? All I am asking is a formula that will take as inputs imputation dosage data from a single SNP and will estimate the "r square hat" metric.

Thanks a lot!

imputation statistics • 5.0k views
modified 7.8 years ago • written 7.8 years ago by Kantale120
1

See "Imputation and association testing" about INFO calculation: http://hmg.oxfordjournals.org/content/17/R2/R122.full It seems mach quality metric is RSQR_HAT, is it not the same metric? Also, using plink with dosage files gives out INFO http://pngu.mgh.harvard.edu/~purcell/plink/dosage.shtml

The reason I cannot use the RSQRHAT from mach is that I have performed sample chunking. So I do have the RSQRHAT value for each of my sample chunks (around 30) but I want to be able to compute it in total for all samples per SNP.

Not sure what is the common practice for this situation, but I think getting mean/median of RSQHATs over chunks should be OK, if you want to be strict then maybe minimum. Again, if you run plink assoc on dosage, the INFO would be calculated overall.

1
Kantale120 wrote:

I was able to figure out what is happening (with the help of QuickTest author who responded to me kindly). On page 32 of this document there is a detailed presentation of the r square hat metric. PLINK's INFO metric uses the G2 definition whereas QuickTest with r−squared-hat option uses G3. In principal, as it is discussed in the document these metrics are equivalent under HWE otherwise can generate different values. I wrote these formulas in python for whoever is interested:

Dear Kantale, although it's long after your post, I am interested to see your python script to compute the r square hat metric to assess the quality of the imputation done by Beagle software. The links you gave can not be reached. It would be of great help if you kindly provide those again. Thank you in advance