Why Are There Two Alleles For Chromosome X In Males (In 1000Genomes Vcf Files)?
2
8
Entering edit mode
12.0 years ago
agnieszka ▴ 110

If you look at variants in X chromosome in current VCF files from 1000genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp//release/20110521/ALL.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz, there are often two alleles for male individuals. For example, for HG00096, we can find:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
X 60034 . ACC A 256 PASS AVGPOST=0.9664;LDAF=0.0610;THETA=0.0087;ERATE=0.0027;RSQ=0.7797;AC=117;AN=2184;VT=INDEL;AF=0.05;ASN_AF=0.07;AMR_AF=0.05;AFR_AF=0.09;EUR_AF=0.02 GT:DS:GL 0|0:0.150:0,0,0

X 60052 rs186434315 T A 100 PASS AC=752;AN=2184;VT=SNP;AA=.;AVGPOST=0.9538;RSQ=0.9370;SNPSOURCE=LOWCOV;LDAF=0.3410;ERATE=0.0006;THETA=0.0058;AF=0.34;ASN_AF=0.21;AMR_AF=0.35;AFR_AF=0.33;EUR_AF=0.46 GT:DS:GL 0|1:1.000:-0.18,-0.47,-2.40
...
X 63621 rs189671919 G A 100 PASS AC=273;AN=2184;RSQ=0.5045;ERATE=0.0093;LDAF=0.1881;VT=SNP;AA=.;THETA=0.0076;SNPSOURCE=LOWCOV;AVGPOST=0.8028;AF=0.12;ASN_AF=0.10;AMR_AF=0.14;AFR_AF=0.09;EUR_AF=0.16 GT:DS:GL 1|1:1.650:-3.22,-0.47,-0.18
...
X 85928 rs145862927 A T 100 PASS AVGPOST=0.9752;LDAF=0.0251;AN=2184;VT=SNP;AA=.;AC=31;ERATE=0.0010;SNPSOURCE=LOWCOV;THETA=0.0113;RSQ=0.5623;AF=0.01;ASN_AF=0.01;AMR_AF=0.01;AFR_AF=0.0020;EUR_AF=0.03 GT:DS:GL 1|0:1.000:-2.28,-0.01,-1.55

So we have phased variants of type '0|0', '1|0', '0|1' and '1|1'. How is it possible if there is only one chromosome X in male individual? We are given a phased alleles, so where does the second (right site of '|') variant go, if it is present ('1')?

I found some notes about pseudo-autosomal regions, but I do not fully understand that. Does it mean that the second variant is on the male Y chromosome?

vcf 1000genomes variant • 7.2k views
ADD COMMENT
1
Entering edit mode

Google "Pseudoautosomal region".

ADD REPLY
1
Entering edit mode

I can understand that there is a crossing over between X and Y in the pseudoautosomal regions. What I don't get is why in the VCF files there are two alternative alleles for X chromosome of male individual, while there is only one X chromosome for male.

ADD REPLY
0
Entering edit mode

There are males with two X chromosomes and one Y.

ADD REPLY
2
Entering edit mode

The case (two alleles for X chromosome, for male individual) is valid for all males in VCF files from 1000genomes. I don't think that all of them have this disorder.

ADD REPLY
5
Entering edit mode
12.0 years ago
bdemarest ▴ 460

The loci in your example are within PAR1 (pseudoautosomal region 1). If a male is heterozygous then it must be the case that one allele is on X and the other allele is on Y. I suspect that if you check other regions of X in males you will find very little evidence of heterozygosity.

From http://genome.ucsc.edu/cgi-bin/hgGateway:

The Y chromosome in this assembly contains two pseudoautosomal regions (PARs) that were taken from the corresponding regions in the X chromosome and are exact duplicates:

chrY:10001-2649520 and chrY:59034050-59363566
chrX:60001-2699520 and chrX:154931044-155260560

ADD COMMENT
5
Entering edit mode
12.0 years ago
lh3 33k

Let's consider another case: two homologous chromosome 1. With the standard technology, you only get heterozygotes but do not know for sure which homologous chromosome an allele belongs to. In PAR, X and Y behave nearly exactly the same as two homologous autosomes. One of the allele you see is from chrX and the other from chrY, but you do not know which allele is on X and which on Y.

Btw, PAR is the key reason why we should NOT use the UCSC genomes for mapping. UCSC puts identical PAR on both X and Y. When you do read mapping, reads will be randomly distributed between two identical copies with mapQ=0. The end result is you will get no variants from PAR with the current pipelines. Most of us do not care about PAR, but we should try to use the better strategy when possible.

ADD COMMENT
0
Entering edit mode

Thanks for this explanation. I am interested in PAR variants--which genome assembly should I be using? Does it solve the problem by omitting PARs from one chromosome?

ADD REPLY
0
Entering edit mode

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/

There are two versions: humang1kv37.fasta.gz and phase2_reference_assembly_sequence. The latter contains extra pieces missing from GRCh37.

ADD REPLY

Login before adding your answer.

Traffic: 842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6