Question

Discrepancies between 1000 Genomes Phase 1 vs. Phase 3 allele frequencies

5

Entering edit mode

9.9 years ago

Greg P ▴ 70

I would like to understand the reason why Phases 1 and 3 of the 1000 Genomes data have very different allele frequencies for certain SNPs.

I have been comparing SNP allele frequencies among a certain group of individuals ("cases") with the allele frequencies reported by the 1000 Genomes project (specifically the EUR super population). For a small group of about 20 SNPs I have found extreme differences between these two allele frequencies. An example was the SNP rs533515.

However, the allele frequency differences were so extreme that I was suspicious. Looking a bit further, I noticed that this SNP has vastly different allele frequencies reported in the different Phases of the 1000g data. Initially I had been working with Phase 3, assuming it was more up to date and therefore "better." For rs533515 in particular, I can find the Phase 3 allele frequency as follows

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20130502/ALL.chr11.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'

with the result

A    C    AC=12;AF=0.00239617;AN=5008;NS=2504;DP=14408;EAS_AF=0.001;AMR_AF=0.0029;AFR_AF=0;EUR_AF=0.001;SAS_AF=0.0082;AA=A|||

Note that the European allele frequency given by Phase 3 is very small, 0.001.

Now, I can get the same information from Phase 1:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20110521/ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'

and the result is

A    C    AVGPOST=0.9963;AC=1503;SNPSOURCE=LOWCOV,EXOME;AN=2184;ERATE=0.0047;VT=SNP;THETA=0.0006;AA=A;RSQ=0.9941;LDAF=0.6883;AF=0.69;ASN_AF=0.64;AMR_AF=0.68;AFR_AF=0.47;EUR_AF=0.87

Here, the European allele frequency is 0.87. This is in fact much closer to the allele frequency among my "cases." I found this to be the case for all the SNPs for which my "case" frequencies differed wildly from the Phase 3 allele frequencies.

What is the reason for this large difference between the allele frequencies according to Phases 1 and 3 of 1000 Genomes? Should I be using only Phase 1 data at this point (this is what the NCBI 1000 Genomes Browser does)?

snp • 5.0k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Greg P ▴ 70

0

Entering edit mode

Since I added my comment I'm actually seeing more of these issues which is very concerning since I spent a lot of time adding the Phase3 VCFs to my annotation pipeline. Here's another locus missing from Phase3 calls with a high allele frequency in Phase1.

Phase1:

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz '17:21319121-21319121'
[get_local_version] downloading the index file...
17    21319121    rs1714864    C    T    100    PASS    AA=C;AC=1091;AF=0.50;AFR_AF=0.50;AMR_AF=0.50;AN=2184;ASN_AF=0.50;AVGPOST=0.9988;ERATE=0.0004;EUR_AF=0.50;LDAF=0.4995;RSQ=0.3613;SNPSOURCE=EXOME;THETA=0.0002;VT=SNP

Phase3:

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz '17:21319121-21319121'
[get_local_version] downloading the index file...

No SNP

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Vivek ★ 2.7k

Ram · Answer 1 · 2014-11-11

1

Entering edit mode

9.9 years ago

Vivek ★ 2.7k

Only difference I can see is that an Indel is being in called in the vicinity of this SNP in the phase 3 call-set with a good allele count which might effect the allele counts for the SNP.

11    64497185    .    CAAAA    AAAAA,AAAAAA,CAAAAA,CAAAAAA,CAAA,C    100    PASS    AC=257,44,570,3,73,20;AF=0.0513179,0.00878594,0.113818,0.000599042,0.0145767,0.00399361;AN=5008;NS=2504;DP=14380;EAS_AF=0.0179,0,0.1915,0.001,0,0;AMR_AF=0.013,0,0.2651,0,0,0.0101;AFR_AF=0.0673,0.0189,0.0862,0.0015,0.0552,0.0023;EUR_AF=0.0149,0,0.0596,0,0,0.008;SAS_AF=0.1288,0.0194,0.0194,0,0,0.002

ADD COMMENT • link 9.9 years ago by Vivek ★ 2.7k

0

Entering edit mode

Can you elaborate? Is it possible for there to be an ambiguity between a SNP and an indel? Or perhaps between several simultaneous variants and an indel, or something of that kind?

ADD REPLY • link 9.9 years ago by Greg P ▴ 70

0

Entering edit mode

Depending on how they count alleles over the population there is potential for ambiguity here. What appears to have been counted as an A>C change in the phase1 call set might be getting counted as a deletion of consecutive As in the phase 3 calls.

The reference sequence in this region:

>11:64497185-64497198
CAAAAAAAAAAAAC

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Vivek ★ 2.7k

score 0 · Answer 2 · 2014-11-11

AFAIK the genotypes have been classified into categories differently in Phase 3 compared to Phase 1. Are we sure the subtypes for EUR have not changed between the phases?

EDIT: I just dug a bit deeper, and it seems I was mistaken. While the number of alleles has increased, I don't think the classification basis has changed. The increase in allele quantity should not result in such a huge change in AF values. Let's see what others have to say about this.