Question: 1000 Genomes Project Reports Invalid Alleles for SNPs?
0
gravatar for SpacemanSpiffo
14 months ago by
SpacemanSpiffo10 wrote:

I am attempting to utilise a SNP genotyping data set to infer ancestry using ADMIXTURE, and I want to use the 1000 Genomes Project Phase 3 data as background allele frequencies for populations. However, there is a mismatch between the alleles called for the data I am using and that used by 1KG. One potential issue is that the MAP file provided in 1KG's repository use chromosome number and base position in place of ID, however I have been using these to match SNPs to my dataset.

To give an example:

The SNP annotated in 1KG's MAP file with the ID 1:100612675, chr1 pos100612675, is reported as being GG for sample HG02922 in the PED file. This SNP locus maps to RSID rs499479, which in my own dataset is called as having T and C alleles for each sample.

dbSNP reports that the two possible alleles for this SNP are C and T, implying that the SNP is correct in my current dataset, but not in 1KG, however it seems more likely that there is some other reason for this?

If anyone is able to help out, it'd be appreciated.

Thanks in advance.

genotyping snp dbsnp 1kg • 457 views
ADD COMMENTlink modified 14 months ago by RamRS24k • written 14 months ago by SpacemanSpiffo10
1
gravatar for Emily_Ensembl
14 months ago by
Emily_Ensembl19k
EMBL-EBI
Emily_Ensembl19k wrote:

According to dbSNP, the C/T are the alleles on the reverse strand. All the 1000 Genomes alleles are reported on the forward strand, making GG homozygous reference. Please check the strand on the dataset you're using.

ADD COMMENTlink written 14 months ago by Emily_Ensembl19k

Hi Emily, thank you for responding - having looked further into what you said, it does appear that a mismatch of strandedness is my problem. The files I have, which were exported directly from GenomeStudio, report "Forward Strand" alleles for each sample - however, having now looked, it seems that this name is inaccurate. It instead is using whichever strand is used to report the reference allele in dbSNP. This means that for SNPs in which the reverse strand is used to report the reference allele in dbSNP, there is a mismatch between my data, and the 1KG data which always uses forward strand.

Do you happen to know if there is a way I can get a full list of the SNPs for which the reverse strand is used to report the reference allele? Checking this manually for 500,000 SNPs will be difficult.

Thanks again.

ADD REPLYlink written 14 months ago by SpacemanSpiffo10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1286 users visited in the last hour