Question: removing duplicate SNPs (same position) with lowest call rate
gravatar for jani.p.heikkinen
5.0 years ago by
jani.p.heikkinen0 wrote:

I am trying to solve a problem with my genotyped array data set. For reason or another, the data set has duplicate or with three different names pointing to the same position. For example:

index SNP pos A1 A2 F_MISS
2046 snp_1 113890304 C T 0
2047 snp_2 113890304 C T 0.000422
2048 snp_3 113890304 C T 0

I want to build a list for SNP names to be removed (so I can exclude them in PLINK).

So from the SNPs above, snp_1 or snp_3 and snp_2 should be in removal list.

How would I achieve this?


snp genome • 1.9k views
ADD COMMENTlink modified 4.2 years ago by Biostar ♦♦ 20 • written 5.0 years ago by jani.p.heikkinen0
gravatar for TriS
5.0 years ago by
United States, Buffalo
TriS4.3k wrote:

if you just want to remove duplicates in R (not tested):

name_position <- apply(mySNPmatrix,1,function(x) paste(x[2],x[3],sep="_"))
mySNPmatrix <- mySNPmatrix[-which(duplicated(name_position)),]

however, it seems that the F_MISS col is not duplicated, so pay attention to that when removing rows

ADD COMMENTlink written 5.0 years ago by TriS4.3k
gravatar for christopher medway
5.0 years ago by
Cardiff, UK
christopher medway450 wrote:

Try this bash one-liner (not tested). You may need to loose the header line though.

sort -k 3 -k 6 input.txt | awk '!seen[$3]++' | awk '{print $2}' > output.txt
ADD COMMENTlink written 5.0 years ago by christopher medway450
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour