Question

Merging sets of autosomal SNPs from the same array, but I am losing ~50,000 SNPs. Ideas what is going on?

0

Entering edit mode

10.0 years ago

devenvyas ▴ 770

I have a data set of 608,038 autosomal SNPs from the Affymetrix Human Origins array, which has 620,744 autosomal SNPs total. These were filtered in Genotyping Console, such that they are all autosomal SNPs with 95% SNP call-rates from 93+2 samples (+2 = positive control DNA samples) (and exclude ~4000 possibly triallelic sites).

I have a downloaded a dataset with 594,924 autosomal SNPs generated using the same type of array (http://genetics.med.harvard.edu/reich/Reich_Lab/Datasets_files/EuropeFullyPublic.tar.gz).

When I merge the data sets, the total number of SNPs fall to 545,956.

This happens when I merge the data sets in Plink as well as when I look at the intersection of rs/Affx-#s from both data sets in R

first <- read.table("first.map", header=F)
second <- read.table("second.map", header=F)  
intersect(first$V2,second$V2) -> consensus

I was wondering if anyone has any idea where these ~50,000 SNPs are going.

This is still plenty of SNPs for human evolutionary/population genetic inferences, but I don't want to be asked one day by an article reviewer why I lack this many sites from data merger.

Thanks!

SNP • 2.3k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by devenvyas ▴ 770

Ram · Answer 1 · 2015-07-27

0

Entering edit mode

10.0 years ago

rbagnall ★ 1.8k

There are many duplicated Affx-#s making up ~65,000 rows.

Original length = 693,587

(cut -f 2 Axiom_GW_HuOrigin.na35.af_supp.txt | wc -l )

Unique rows only = 628,082

(cut -f 2 Axiom_GW_HuOrigin.na35.af_supp.txt | sort | uniq | wc -l )

R's intersect will discard any duplicated values

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by rbagnall ★ 1.8k

0

Entering edit mode

I don't think that is quite it.

See page #16 of this. There are 629,443 SNPs properly on the array with 633,625 probes for those SNPs. The ~4000 SNPs with multiple probes were excluded from my SNP list as they are thought to be triallelic. I believe the remaining from the 693,587 are for Dish QC and other quality control things.

I have checked both my data and the downloaded data for duplicated Affx-#s as well as rs #s (the comparative data has Affx-#s only, so I ran a script on it to substitution Affx to rs when rs is available). With and without said script, there are no duplicates in either data set.

One question that has come to mind: do SNPs ever see their Affx-#s change? The latest annotation file is fairly recent (from March), and the data set was published well before.

Could the discrepancy be because the Affx-#s changed between the generation of their data and the generation of my data?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by devenvyas ▴ 770

0

Entering edit mode

So on the advice of the labmate, instead of loading the Affx numbers into R, I loaded the Hg19 locations into R as two lists in the format of chrom# hyphen bp# (e.g., 16-24417536). This intersection yielded 585,413 sites (compared to 545,956 when intersecting based on Affx numbers).

Now I am wondering, how do I tell Plink to ignore the Affx numbers and merge based upon Hg19 locations?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by devenvyas ▴ 770