Question: Merging sets of autosomal SNPs from the same array, but I am losing ~50,000 SNPs. Ideas what is going on?
0
gravatar for devenvyas
3.8 years ago by
devenvyas570
Stony Brook
devenvyas570 wrote:

I have a data set of 608,038 autosomal SNPs from the Affymetrix Human Origins array, which has 620,744 autosomal SNPs total. These were filtered in Genotyping Console, such that they are all autosomal SNPs with 95% SNP call-rates from 93+2 samples (+2 = positive control DNA samples) (and exclude ~4000 possibly triallelic sites).

I have a downloaded a dataset with 594,924 autosomal SNPs generated using the same type of array (http://genetics.med.harvard.edu/reich/Reich_Lab/Datasets_files/EuropeFullyPublic.tar.gz).

When I merge the data sets, the total number of SNPs fall to 545,956.

This happens when I merge the data sets in Plink as well as when I look at the intersection of rs/Affx-#s from both data sets in R

first <- read.table("first.map", header=F)
second <- read.table("second.map", header=F)  
intersect(first$V2,second$V2) -> consensus

I was wondering if anyone has any idea where these ~50,000 SNPs are going.

This is still plenty of SNPs for human evolutionary/population genetic inferences, but I don't want to be asked one day by an article reviewer why I lack this many sites from data merger.

Thanks!

 

 

snp • 971 views
ADD COMMENTlink modified 3.8 years ago by rbagnall1.4k • written 3.8 years ago by devenvyas570
0
gravatar for rbagnall
3.8 years ago by
rbagnall1.4k
Australia
rbagnall1.4k wrote:

There are many duplicated Affx-#s making up ~65,000 rows.

Original length = 693,587

(cut -f 2 Axiom_GW_HuOrigin.na35.af_supp.txt | wc -l )

Unique rows only = 628,082

(cut -f 2 Axiom_GW_HuOrigin.na35.af_supp.txt | sort | uniq | wc -l )

R's intersect will discard any duplicated values

ADD COMMENTlink written 3.8 years ago by rbagnall1.4k

I don't think that is quite it.

See page #16 of this (ftp://ftp.cephb.fr/hgdp_supp10/8_12_2011_Technical_Array_Design_Document.pdf). There are 629,443 SNPs properly on the array with 633,625 probes for those SNPs. The ~4000 SNPs with multiple probes were excluded from my SNP list as they are thought to be triallelic. I believe the remaining from the 693,587 are for Dish QC and other quality control things.

I have checked both my data and the downloaded data for duplicated Affx-#s as well as rs #s (the comparative data has Affx-#s only, so I ran a script on it to substitution Affx to rs when rs is available). With and without said script, there are no duplicates in either data set.

One question that has come to mind: do SNPs ever see their Affx-#s change? The latest annotation file is fairly recent (from March), and the data set was published well before.

Could the discrepancy be because the Affx-#s changed between the generation of their data and the generation of my data?

 

ADD REPLYlink written 3.8 years ago by devenvyas570

So on the advice of the labmate, instead of loading the Affx numbers into R, I loaded the Hg19 locations into R as two lists in the format of chrom# hyphen bp# (e.g., 16-24417536). This intersection yielded 585,413 sites (compared to 545,956 when intersecting based on Affx numbers).

Now I am wondering, how do I tell Plink to ignore the Affx numbers and merge based upon Hg19 locations?

 

ADD REPLYlink written 3.8 years ago by devenvyas570
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2006 users visited in the last hour