R programming: compare columns to column and get the mismatch
2
4
Entering edit mode
9.0 years ago
MAPK ★ 2.1k

Hi Guys,

I have a dataframe (df1) with more than 1000 columns, I would like to compare the two successive columns starting from the first column (1GT) and check if the contents match. For example, I want to compare column 1GT with column 1XGT and and get the concordance column in result with match or mismatch decision. Thank you.

df1:

1GT   1XGT   2GT    2XGT
0/0   0/1    zero   zero
0/1   0/1    one    zero

Result:

1GT   1XGT   concordance   2GT    2XGT   concordance
0/0   0/1    mismatch      zero   zero   match
0/1   0/1    match         one    zero   mismatch
R • 43k views
ADD COMMENT
3
Entering edit mode
9.0 years ago
lkmklsmn ▴ 970

R code:

match(df$1GT, df$1XGT)

Will give you the indices for entries 1GT in 1XGT

Now if you want a vector with "concordant"/ "discordant" you can generate a vector of all "discordant" and then use the match command:

tmp<-rep("discordant", nrow(df))
tmp[match(df$1GT, df$1XGT)]<-"concordant"
ADD COMMENT
0
Entering edit mode

Thank you, what if I have thousands of columns and want to match two columns at a time across the table?

ADD REPLY
1
Entering edit mode
9.0 years ago
zx8754 11k

Something like below, idea is to compare odd columns with even colunns, that is what seq is doing, then use colSums to get sum over TRUE values, then divide by number of rows - number of samples:

#dummy data
df <- data.frame(
  snp1=c(1,1,1,2,2),
  snp1a=c(1,0,1,2,2),
  snp2=c(1,1,1,2,2),
  snp2a=c(1,1,1,2,2),
  snp3=c(1,2,1,2,2),
  snp3a=c(1,2,2,2,2))

df
#   snp1 snp1a snp2 snp2a snp3 snp3a
# 1    1     1    1     1    1     1
# 2    1     0    1     1    2     2
# 3    1     1    1     1    1     2
# 4    2     2    2     2    2     2
# 5    2     2    2     2    2     2

#concordance
colSums(df[,seq(1,ncol(df),2)]==df[,seq(2,ncol(df),2)])/nrow(df)
# snp1 snp2 snp3 
# 0.8  1.0  0.8
ADD COMMENT
0
Entering edit mode

Thank you very much.

ADD REPLY
0
Entering edit mode

What is this you get from your example?

snp1 snp2 snp3 
# 0.8  1.0  0.8 
ADD REPLY
1
Entering edit mode

It is overall concordance for all samples. 0.8 for snp1 means, 80% of samples have same call for snp1 and snp1a.

ADD REPLY
0
Entering edit mode

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1789 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6