Question

is local ancestry inference typically always run w/ array genotypes instead of imputed genotypes?

1

Entering edit mode

2.8 years ago

curious ▴ 750

local ancestry inference is usually done with rfmix, for large cohorts (uk biobank etc) I get the strong impression that this is usually done on array genotypes (few hundred thousand) instead on imputed genotypes (millions). Is this the case?

ancestry • 1.3k views

ADD COMMENT • link updated 2.8 years ago by LauferVA 4.2k • written 2.8 years ago by curious ▴ 750

score 0 · Answer 1 · 2021-08-11

This is a very difficult question to answer precisely. The theoretical argument is clear (based on information content literature), but in practice there are a lot of ways to muddy the waters...Let me give a theoretical argument first, then make several practical arguments afterwards. I hope that will do an OK job of getting at the theory but still also addressing practical concerns.

Theory: There is information about the ancestry of an individual that can be extracted from correctly genotyped markers. If the local ancestry algorithm (hereafter, LAA) and the imputation algorithm (hereafter, IA) are equally good at extracting the information content from the source data (and if they are provided the same data to begin with) then there is no reason the imputation algorithm (IA) should outperform the LAA. This is why most authors (that I am aware of) use only the genotyped data - there is no information gain and you are only providing additional redundant information. However, if either assumption doesn't hold, then using the imputed output could actually be better. Let me try to clarify by breaking down the input data portion: Input Data:

Your sample genotypes
All reference genotypes provided
Any additional information provided to an algorithm To reiterate, in theory, it should not matter what you call the algorithm (LAA vs IA) - as long as it is implemented appropriately and receives the same data, you should not get any better estimate from one rather than the other. Ideally, the algorithm will generate a "sufficient statistic" for the local ancestry estimate, which you can think of as the estimate you'd get if you correctly extract all the information from 1. - 3.

Praxis: However, there are a number of problems and issues that may or may not apply to a particular project that could influence the process. I'll try to divide these into a sort of "pro" and "con" type list, here:

Potential reasons why including imputed markers could increase accuracy of local ancestry estimates (LAEs): Many imputation algorithms run "in the cloud" so to speak. If you do not have access to all the background/reference genotypes (data type "2" in the list above) used to make the imputation estimates, and cannot get access to them, then it might be possible to generate a better local ancestry estimate using the imputed data than the genotyped data alone. Some imputation algorithms conduct pre-phasing, etc. using specialized panels and software. Again, if you do not have access to all of that, then it might be possible to generate better LAEs using the imputed, phased genotypes than your raw genotyping data.

Potential reasons why including imputed markers could decrease accuracy of LAEs: If the local ancestry estimation software will regard any variant you provide as "ground truth" and does not allow for the possibility the estimate is an incorrectly imputed genotype, then it is likely you will decrease the accuracy of your LAEs by including imputed markers (specifically, imputed markers with low quality metrics). Assuming you LAA is no better and no worse at retaining information than the IA, it may run more slowly than if you ran the data on the genotyped samples only (because you are inputing far more markers) with no more actual information (because of the assumption at the beginning of the sentence).

Things you should do no matter what you decide: No matter what, data preparation and QC will have the biggest impact on the final results...

1A. You should include only genotyped variants that have high quality metrics (low missingness, no differential missingness between groups, not very far out of HWE, etc.)
1B. In just the same way, if you do decide to use imputed markers, you should remove all imputed variants with low imputation quality. Personally, I would be pretty stringent about this (only include variants you are quite sure are imputed with v. high accuracy).
1C. If possible, run the analysis twice, once on G + I and once on genotyped only, and see how (dis)similar they are. If I can be of further help, let me know.