Problems Imputing X Chromosome with TOPMed
1
1
Entering edit mode
11 weeks ago
kianalee ▴ 10

I have a large dataset whose autosomes I was able to successfully phase and impute using TOPMed. I have tried doing the same with the X chromosome but keep running into issues.

Before trying to impute with TOPMed, I did per-individual QC and per-marker QC, then ran checkVCF, and corrected any issues identified with checkVCF such as fixing allele flips, removing duplicate sites, etc. – I performed this on the complete dataset including the X chromosome.

However, when trying to impute the X chromosome it doesn’t get past TOPMed Quality Control. I did find an old post about someone have issues with their X chromosome when using the Michigan imputation server.

They specifically got an error that there were heterozygous variants in their males. Thus, it was suggested that they correct for these heterozygous haploid errors using the following PLINK command:

plink --bfile Input_data --set-hh-missing --recode vcf --out Result_filename

I did this with my files, and I can now impute with TOPMed.

However, I am concerned because when I use PLINK's --sample-counts, I see that all my individuals are assigned an "ambiguous" sex even though I know I have both male and female indivdiuals. So does this mean I am losing heterozygous variant information on my female individuals?

x topmed chromosome • 268 views
1
Entering edit mode
9 weeks ago
nleone ▴ 20

Hi Kianalee,

I am working with X chromosome data and imputing with TOPMed server.

There are a lot of issues that come up when working with ChrX:

1. heterozygous haploid errors can occur in male samples if you called genotypes from the array chip data (intensity files) with males and females (together at the same time and same dataset)
2. ChrX has 3 regions, where 2 are diploid (PAR — pseudoautosomal regions) and one span between that is haploid (not pseudoautosomal)
3. non-PAR genotypes are haploid in males but in females there is X-inactivation, so you might want to stratify by sex early in your data prep and (downstream) analyze separately
4. Non-PAR genotype calls among males and females samples can lead to heterozygous calls in the males. In non-PAR regions that is not possible so you get the heterozygous haploid (hh) error.
5. You can set the hh calls to missing or re-run your genotype calling in just the males (and also the same for just females) on ChrX. Regardless, you should separate your males and female samples for the non-PAR region.
6. If you don’t want to go back to and call genotypes separately for males and females, at least separate males and females in your genotype data and create PAR & nonPAR datasets. Then set hh-missing, QC for imputation, and separately impute male Non-PAR and female non-PAR datasets.
7. If you do go back and call genotypes separately for males and females, consider manually checking the clustering on ChrX for males. You may need to tweak this to improve call accuracy.

This should fix the hh-error messages from TOPMed server. Good luck!

Dominick A. Leone Boston University School of Public Health Epidemiology Department

0
Entering edit mode

It will also deal with ambiguity in sex :)

0
Entering edit mode

Hi nLeone, You have highlighted a lot of points here regarding the X chromosome. At some point could you write a detailed blog post regarding the same? With examples and how one would do QC on sex if their main intention is analysis on autosomes only. Your experience would really help the community and especially beginners. Thanks, I did get a few pointers from here myself.