Question

R programming: compare columns to column and get the mismatch

4

Entering edit mode

9.3 years ago

MAPK ★ 2.1k

Hi Guys,

I have a dataframe (df1) with more than 1000 columns, I would like to compare the two successive columns starting from the first column (1GT) and check if the contents match. For example, I want to compare column 1GT with column 1XGT and and get the concordance column in result with match or mismatch decision. Thank you.

df1:

1GT   1XGT   2GT    2XGT
0/0   0/1    zero   zero
0/1   0/1    one    zero

Result:

1GT   1XGT   concordance   2GT    2XGT   concordance
0/0   0/1    mismatch      zero   zero   match
0/1   0/1    match         one    zero   mismatch

R • 44k views

ADD COMMENT • link updated 19 months ago by Ram 44k • written 9.3 years ago by MAPK ★ 2.1k

Ram · Answer 1 · 2015-05-08

3

Entering edit mode

9.3 years ago

lkmklsmn ▴ 970

R code:

match(df$1GT, df$1XGT)

Will give you the indices for entries 1GT in 1XGT

Now if you want a vector with "concordant"/ "discordant" you can generate a vector of all "discordant" and then use the match command:

tmp<-rep("discordant", nrow(df))
tmp[match(df$1GT, df$1XGT)]<-"concordant"

ADD COMMENT • link updated 19 months ago by Ram 44k • written 9.3 years ago by lkmklsmn ▴ 970

0

Entering edit mode

Thank you, what if I have thousands of columns and want to match two columns at a time across the table?

ADD REPLY • link 9.3 years ago by MAPK ★ 2.1k

Ram · Answer 2 · 2015-05-08

1

Entering edit mode

9.3 years ago

zx8754 12k

Something like below, idea is to compare odd columns with even colunns, that is what seq is doing, then use colSums to get sum over TRUE values, then divide by number of rows - number of samples:

#dummy data
df <- data.frame(
  snp1=c(1,1,1,2,2),
  snp1a=c(1,0,1,2,2),
  snp2=c(1,1,1,2,2),
  snp2a=c(1,1,1,2,2),
  snp3=c(1,2,1,2,2),
  snp3a=c(1,2,2,2,2))

df
#   snp1 snp1a snp2 snp2a snp3 snp3a
# 1    1     1    1     1    1     1
# 2    1     0    1     1    2     2
# 3    1     1    1     1    1     2
# 4    2     2    2     2    2     2
# 5    2     2    2     2    2     2

#concordance
colSums(df[,seq(1,ncol(df),2)]==df[,seq(2,ncol(df),2)])/nrow(df)
# snp1 snp2 snp3 
# 0.8  1.0  0.8

ADD COMMENT • link updated 19 months ago by Ram 44k • written 9.3 years ago by zx8754 12k

0

Entering edit mode

Thank you very much.

ADD REPLY • link 9.3 years ago by MAPK ★ 2.1k

0

Entering edit mode

What is this you get from your example?

snp1 snp2 snp3 
# 0.8  1.0  0.8

ADD REPLY • link 9.3 years ago by MAPK ★ 2.1k

1

Entering edit mode

It is overall concordance for all samples. 0.8 for snp1 means, 80% of samples have same call for snp1 and snp1a.

ADD REPLY • link 9.3 years ago by zx8754 12k

0

Entering edit mode

Thanks!

ADD REPLY • link updated 19 months ago by Ram 44k • written 9.3 years ago by MAPK ★ 2.1k