Question: R programming: compare columns to column and get the mismatch
4
5.5 years ago by
MAPK1.6k
MAPK1.6k wrote:

Hi Guys,

I have a dataframe (df1) with more than 1000 columns, I would like to compare the two successive columns starting from the first column (1GT) and check if the contents match. For example, I want to compare column 1GT with column 1XGT and and get the concordance column in result with match or mismatch decision. Thank you.

df1:

 1GT 1XGT 2GT 2XGT 0/0 0/1 zero zero 0/1 0/1 one zero

Result:

 1GT 1XGT concordance 2GT 2XGT concordance 0/0 0/1 mismatch zero zero match 0/1 0/1 match one zero mismatch

R • 39k views
modified 5.5 years ago by zx87549.7k • written 5.5 years ago by MAPK1.6k
3
5.5 years ago by
lkmklsmn930
United States
lkmklsmn930 wrote:
R code: match(df\$1GT, df\$1XGT) Will give you the indices for entries 1GT in 1XGT Now if you want a vector with "concordant"/ "discordant" you can generate a vector of all "discordant" and then use the match command: tmp<-rep("discordant", nrow(df)) tmp[match(df\$1GT, df\$1XGT)]<-"concordant"

Thank you, what if I have thousands of columns and want to match two columns at a time across the table?

1
5.5 years ago by
zx87549.7k
London
zx87549.7k wrote:

Something like below, idea is to compare odd columns with even colunns, that is what seq is doing, then use colSums to get sum over TRUE values, then divide by number of rows - number of samples:

```#dummy data
df <- data.frame(
snp1=c(1,1,1,2,2),
snp1a=c(1,0,1,2,2),
snp2=c(1,1,1,2,2),
snp2a=c(1,1,1,2,2),
snp3=c(1,2,1,2,2),
snp3a=c(1,2,2,2,2))

df
#   snp1 snp1a snp2 snp2a snp3 snp3a
# 1    1     1    1     1    1     1
# 2    1     0    1     1    2     2
# 3    1     1    1     1    1     2
# 4    2     2    2     2    2     2
# 5    2     2    2     2    2     2

#concordance
colSums(df[,seq(1,ncol(df),2)]==df[,seq(2,ncol(df),2)])/nrow(df)
# snp1 snp2 snp3
# 0.8  1.0  0.8 ```

Thank you very much.

What is this you get from your example?

```snp1 snp2 snp3
# 0.8  1.0  0.8 ```
1

It is overall concordance for all samples. 0.8 for snp1 means, 80% of samples have same call for snp1 and snp1a.

Thanks!