Question: Why Are There Two Alleles For Chromosome X In Males (In 1000Genomes Vcf Files)?
8.1 years ago
If you look at variants in X chromosome in current VCF files from 1000genomes:, there are often two alleles for male individuals. For example, for HG00096, we can find:

X 60034 . ACC A 256 PASS AVGPOST=0.9664;LDAF=0.0610;THETA=0.0087;ERATE=0.0027;RSQ=0.7797;AC=117;AN=2184;VT=INDEL;AF=0.05;ASN_AF=0.07;AMR_AF=0.05;AFR_AF=0.09;EUR_AF=0.02 GT:DS:GL 0|0:0.150:0,0,0

X 60052 rs186434315 T A 100 PASS AC=752;AN=2184;VT=SNP;AA=.;AVGPOST=0.9538;RSQ=0.9370;SNPSOURCE=LOWCOV;LDAF=0.3410;ERATE=0.0006;THETA=0.0058;AF=0.34;ASN_AF=0.21;AMR_AF=0.35;AFR_AF=0.33;EUR_AF=0.46 GT:DS:GL 0|1:1.000:-0.18,-0.47,-2.40
X 63621 rs189671919 G A 100 PASS AC=273;AN=2184;RSQ=0.5045;ERATE=0.0093;LDAF=0.1881;VT=SNP;AA=.;THETA=0.0076;SNPSOURCE=LOWCOV;AVGPOST=0.8028;AF=0.12;ASN_AF=0.10;AMR_AF=0.14;AFR_AF=0.09;EUR_AF=0.16 GT:DS:GL 1|1:1.650:-3.22,-0.47,-0.18
X 85928 rs145862927 A T 100 PASS AVGPOST=0.9752;LDAF=0.0251;AN=2184;VT=SNP;AA=.;AC=31;ERATE=0.0010;SNPSOURCE=LOWCOV;THETA=0.0113;RSQ=0.5623;AF=0.01;ASN_AF=0.01;AMR_AF=0.01;AFR_AF=0.0020;EUR_AF=0.03 GT:DS:GL 1|0:1.000:-2.28,-0.01,-1.55

So we have phased variants of type '0|0', '1|0', '0|1' and '1|1'. How is it possible if there is only one chromosome X in male individual? We are given a phased alleles, so where does the second (right site of '|') variant go, if it is present ('1')?

I found some notes about pseudo-autosomal regions, but I do not fully understand that. Does it mean that the second variant is on the male Y chromosome?

Google "Pseudoautosomal region".

I can understand that there is a crossing over between X and Y in the pseudoautosomal regions. What I don't get is why in the VCF files there are two alternative alleles for X chromosome of male individual, while there is only one X chromosome for male.

There are males with two X chromosomes and one Y.

The case (two alleles for X chromosome, for male individual) is valid for all males in VCF files from 1000genomes. I don't think that all of them have this disorder.

8.1 years ago
Salt Lake City, UT, USA
The loci in your example are within PAR1 (pseudoautosomal region 1). If a male is heterozygous then it must be the case that one allele is on X and the other allele is on Y. I suspect that if you check other regions of X in males you will find very little evidence of heterozygosity.


The Y chromosome in this assembly contains two pseudoautosomal regions (PARs) that were taken from the corresponding regions in the X chromosome and are exact duplicates:

chrY:10001-2649520 and chrY:59034050-59363566
chrX:60001-2699520 and chrX:154931044-155260560

8.1 years ago
United States
Let's consider another case: two homologous chromosome 1. With the standard technology, you only get heterozygotes but do not know for sure which homologous chromosome an allele belongs to. In PAR, X and Y behave nearly exactly the same as two homologous autosomes. One of the allele you see is from chrX and the other from chrY, but you do not know which allele is on X and which on Y.

Btw, PAR is the key reason why we should NOT use the UCSC genomes for mapping. UCSC puts identical PAR on both X and Y. When you do read mapping, reads will be randomly distributed between two identical copies with mapQ=0. The end result is you will get no variants from PAR with the current pipelines. Most of us do not care about PAR, but we should try to use the better strategy when possible.

Thanks for this explanation. I am interested in PAR variants--which genome assembly should I be using? Does it solve the problem by omitting PARs from one chromosome?

There are two versions: humang1kv37.fasta.gz and phase2_reference_assembly_sequence. The latter contains extra pieces missing from GRCh37.

