Question: R programming: compare columns to column and get the mismatch
4
gravatar for MAPK
5.5 years ago by
MAPK1.6k
MAPK1.6k wrote:

Hi Guys,

I have a dataframe (df1) with more than 1000 columns, I would like to compare the two successive columns starting from the first column (1GT) and check if the contents match. For example, I want to compare column 1GT with column 1XGT and and get the concordance column in result with match or mismatch decision. Thank you.

 df1:

1GT 1XGT 2GT 2XGT
0/0 0/1 zero zero
0/1 0/1 one zero

 

Result:

1GT 1XGT concordance 2GT 2XGT concordance
0/0 0/1 mismatch zero zero match
0/1 0/1 match one zero mismatch

 

R • 39k views
ADD COMMENTlink modified 5.5 years ago by zx87549.7k • written 5.5 years ago by MAPK1.6k
3
gravatar for lkmklsmn
5.5 years ago by
lkmklsmn930
United States
lkmklsmn930 wrote:
R code: match(df$1GT, df$1XGT) Will give you the indices for entries 1GT in 1XGT Now if you want a vector with "concordant"/ "discordant" you can generate a vector of all "discordant" and then use the match command: tmp<-rep("discordant", nrow(df)) tmp[match(df$1GT, df$1XGT)]<-"concordant"
ADD COMMENTlink written 5.5 years ago by lkmklsmn930

Thank you, what if I have thousands of columns and want to match two columns at a time across the table?

ADD REPLYlink written 5.5 years ago by MAPK1.6k
1
gravatar for zx8754
5.5 years ago by
zx87549.7k
London
zx87549.7k wrote:

Something like below, idea is to compare odd columns with even colunns, that is what seq is doing, then use colSums to get sum over TRUE values, then divide by number of rows - number of samples:

#dummy data
df <- data.frame(
  snp1=c(1,1,1,2,2),
  snp1a=c(1,0,1,2,2),
  snp2=c(1,1,1,2,2),
  snp2a=c(1,1,1,2,2),
  snp3=c(1,2,1,2,2),
  snp3a=c(1,2,2,2,2))

df
#   snp1 snp1a snp2 snp2a snp3 snp3a
# 1    1     1    1     1    1     1
# 2    1     0    1     1    2     2
# 3    1     1    1     1    1     2
# 4    2     2    2     2    2     2
# 5    2     2    2     2    2     2

#concordance
colSums(df[,seq(1,ncol(df),2)]==df[,seq(2,ncol(df),2)])/nrow(df)
# snp1 snp2 snp3 
# 0.8  1.0  0.8 

 

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by zx87549.7k

Thank you very much.

ADD REPLYlink written 5.5 years ago by MAPK1.6k

What is this you get from your example?

snp1 snp2 snp3 
# 0.8  1.0  0.8 
ADD REPLYlink written 5.5 years ago by MAPK1.6k
1

It is overall concordance for all samples. 0.8 for snp1 means, 80% of samples have same call for snp1 and snp1a.

ADD REPLYlink written 5.4 years ago by zx87549.7k

Thanks!

 

ADD REPLYlink written 5.4 years ago by MAPK1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2015 users visited in the last hour