Question: Discrepancies between 1000 Genomes Phase 1 vs. Phase 3 allele frequencies
5
gravatar for Greg P
4.9 years ago by
Greg P70
Brooklyn, NY
Greg P70 wrote:

I would like to understand the reason why Phases 1 and 3 of the 1000 Genomes data have very different allele frequencies for certain SNPs.

I have been comparing SNP allele frequencies among a certain group of individuals ("cases") with the allele frequencies reported by the 1000 Genomes project (specifically the EUR super population). For a small group of about 20 SNPs I have found extreme differences between these two allele frequencies. An example was the SNP rs533515.

However, the allele frequency differences were so extreme that I was suspicious. Looking a bit further, I noticed that this SNP has vastly different allele frequencies reported in the different Phases of the 1000g data. Initially I had been working with Phase 3, assuming it was more up to date and therefore "better." For rs533515 in particular, I can find the Phase 3 allele frequency as follows

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20130502/ALL.chr11.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'

with the result

A    C    AC=12;AF=0.00239617;AN=5008;NS=2504;DP=14408;EAS_AF=0.001;AMR_AF=0.0029;AFR_AF=0;EUR_AF=0.001;SAS_AF=0.0082;AA=A|||

Note that the European allele frequency given by Phase 3 is very small, 0.001.

Now, I can get the same information from Phase 1:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20110521/ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'​

and the result is

A    C    AVGPOST=0.9963;AC=1503;SNPSOURCE=LOWCOV,EXOME;AN=2184;ERATE=0.0047;VT=SNP;THETA=0.0006;AA=A;RSQ=0.9941;LDAF=0.6883;AF=0.69;ASN_AF=0.64;AMR_AF=0.68;AFR_AF=0.47;EUR_AF=0.87

Here, the European allele frequency is 0.87. This is in fact much closer to the allele frequency among my "cases." I found this to be the case for all the SNPs for which my "case" frequencies differed wildly from the Phase 3 allele frequencies.

What is the reason for this large difference between the allele frequencies according to Phases 1 and 3 of 1000 Genomes? Should I be using only Phase 1 data at this point (this is what the NCBI 1000 Genomes Browser does)?

snp • 3.2k views
ADD COMMENTlink modified 4.9 years ago by Vivek2.3k • written 4.9 years ago by Greg P70

Since I added my comment I'm actually seeing more of these issues which is very concerning since I spent a lot of time adding the Phase3 VCFs to my annotation pipeline. Here's another locus missing from Phase3 calls with a high allele frequency in Phase1.

Phase1:

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz '17:21319121-21319121'
[get_local_version] downloading the index file...
17    21319121    rs1714864    C    T    100    PASS    AA=C;AC=1091;AF=0.50;AFR_AF=0.50;AMR_AF=0.50;AN=2184;ASN_AF=0.50;AVGPOST=0.9988;ERATE=0.0004;EUR_AF=0.50;LDAF=0.4995;RSQ=0.3613;SNPSOURCE=EXOME;THETA=0.0002;VT=SNP

Phase3:

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz '17:21319121-21319121'
[get_local_version] downloading the index file...

No SNP

 

ADD REPLYlink written 4.9 years ago by Vivek2.3k
1
gravatar for Vivek
4.9 years ago by
Vivek2.3k
Denmark
Vivek2.3k wrote:

Only difference I can see is that an Indel is being in called in the vicinity of this SNP in the phase 3 call-set with a good allele count which might effect the allele counts for the SNP.

11    64497185    .    CAAAA    AAAAA,AAAAAA,CAAAAA,CAAAAAA,CAAA,C    100    PASS    AC=257,44,570,3,73,20;AF=0.0513179,0.00878594,0.113818,0.000599042,0.0145767,0.00399361;AN=5008;NS=2504;DP=14380;EAS_AF=0.0179,0,0.1915,0.001,0,0;AMR_AF=0.013,0,0.2651,0,0,0.0101;AFR_AF=0.0673,0.0189,0.0862,0.0015,0.0552,0.0023;EUR_AF=0.0149,0,0.0596,0,0,0.008;SAS_AF=0.1288,0.0194,0.0194,0,0,0.002
ADD COMMENTlink written 4.9 years ago by Vivek2.3k

Can you elaborate? Is it possible for there to be an ambiguity between a SNP and an indel? Or perhaps between several simultaneous variants and an indel, or something of that kind?

ADD REPLYlink written 4.9 years ago by Greg P70

Depending on how they count alleles over the population there is potential for ambiguity here. What appears to have been counted as an A>C change in the phase1 call set might be getting counted as a deletion of consecutive As in the phase 3 calls.

The reference sequence in this region: 

>11:64497185-64497198
CAAAAAAAAAAAAC

ADD REPLYlink written 4.9 years ago by Vivek2.3k
0
gravatar for RamRS
4.9 years ago by
RamRS24k
Houston, TX
RamRS24k wrote:

AFAIK the genotypes have been classified into categories differently in Phase 3 compared to Phase 1. Are we sure the subtypes for EUR have not changed between the phases?

EDIT: I just dug a bit deeper, and it seems I was mistaken. While the number of alleles has increased, I don't think the classification basis has changed. The increase in allele quantity should not result in such a huge change in AF values. Let's see what others have to say about this.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 776 users visited in the last hour