Question: R programming: group and then compare columns to column and get the mismatch
2
3
Entering edit mode
8.2 years ago
d04103552 ▴ 30

Hi Guys,

I have a data-frame "mydata" with 200 columns, and over 15.000 Rows

  1. I would like to compare the two columns starting from the first column (XR_res1) and check if the contents match. For example, I want to compare column XR_res3 with column XR_res5 and get the concordance column in result with match or mismatch decision
  2. Group by "Personal_ID" all 15.000 rows and compare mismatches with column result of match or mismatch decision in percents

Thanks

Personal_ID     XR_res1     XR_res2     XR_res3     XR_res4     XR_res5     XR_res6
001             pos         pass        pos         neg         pos         neg
001             pos         pass        neg         pass        pass        pass
001             neg         neg         neg         pos         neg         pass
002             pass        pos         pos         pass        pass        pos
002             pos         pass        pass        neg         pass        pos
003             pass        neg         neg         pos         pass        pos
003             pos         neg         pass        pass        pos         pos
003             pass        pos         pos         pass        pass        neg
003             neg         pos         neg         pass        pos         neg
R • 3.1k views
ADD COMMENT
1
Entering edit mode
8.2 years ago

It isn't entirely clear what you are trying to calculate, but take a look at using the "gower" distance measure. Here is an example

> set.seed(123)
> library(cluster)
> x = data.frame(matrix(sample(c('neg','pos','pass'),100,replace=TRUE),nc=10))
> x
     X1   X2   X3   X4   X5   X6   X7   X8   X9  X10
1   neg pass pass pass  neg  neg  pos pass  neg  neg
2  pass  pos pass pass  pos  pos  neg  pos pass  pos
3   pos pass  pos pass  pos pass  pos pass  pos  pos
4  pass  pos pass pass  pos  neg  neg  neg pass  pos
5  pass  neg  pos  neg  neg  pos pass  pos  neg  neg
6   neg pass pass  pos  neg  neg  pos  neg  pos  neg
7   pos  neg  pos pass  neg  neg pass  pos pass pass
8  pass  neg  pos  neg  pos pass pass  pos pass  neg
9   pos  neg  neg  neg  neg pass pass  pos pass  pos
10  pos pass  neg  neg pass  pos  pos  neg  neg  pos
> 1-daisy(x,'gower') # so that we are measuring similarity, not distance (easier to count)
Dissimilarities :
     1   2   3   4   5   6   7   8   9
2  0.2                                
3  0.4 0.3                            
4  0.3 0.8 0.3                        
5  0.3 0.3 0.1 0.1                    
6  0.7 0.1 0.3 0.3 0.2                
7  0.3 0.3 0.3 0.3 0.5 0.2            
8  0.1 0.4 0.3 0.3 0.7 0.1 0.5        
9  0.1 0.3 0.3 0.2 0.5 0.1 0.6 0.6    
10 0.3 0.2 0.4 0.2 0.3 0.3 0.1 0.1 0.4

Metric :  mixed ;  Types = N, N, N, N, N, N, N, N, N, N 
Number of objects : 10

To get column similarities (or distances), simply transpose your matrix (assuming that everything uses the same three potential values).

ADD COMMENT
0
Entering edit mode

Hi Sean, Thanks for your answer, This solution isn't really working for me,

Sorry I wasn't clear enough in describing the problem

  1. As a first step I simply need to compare 2 columns (with non numeric values) to each other for similarity

    For instance comparing XR_res3 with column XR_res5 to check and get a result of mismatch in percents, here I probably need to use some like all.equal but can't figure out the syntax.

  2. As a step two I need to group data by Personal_ID's and calculate mismatch per group

    so should compare XR_res3 with column XR_res5 for group with Personal_ID 001, then XR_res3 with column XR_res5 for group with Personal_ID 002 and so on,

For that purpose I will build a new data some like:

dt.1<- subset(mydata, select=c("Personal_ID")) dt.2<- subset(mydata, select=c("XR_res3")) dt.3<- subset(mydata, select=c("XR_res5")) dt.x<- cbind(dt.1,dt.2,dt.3)

and my final result should look some like

Personal_ID    XR_res3    XR_res5    Mismatch Result
001            pos        neg        0.50
001            pos        pos    
001            neg        pos
002            neg        pass       1.0
002            neg        pass
003            pos        pos        0.0
003            pass       pass
003            pass       pass
003            neg        neg

I suppose I should use data.tables [,, by= Personal_ID] but again can't figure out the syntax

Thanks

ADD REPLY
0
Entering edit mode
8.1 years ago

For your first problem you can do something like this (I took Sean's code to generate the dummy data):

# Set random seed for reproducibility
set.seed(1)

# Generate dummy data
mydata = data.frame(matrix(sample(c('neg','pos','pass'), 100, replace = TRUE), nc = 10))

# Similarity matrix
sim = apply(mydata, 2, function(x) colSums(x == mydata)/nrow(mydata))

> mydata
     X1   X2   X3   X4   X5   X6   X7   X8   X9  X10
1   neg  neg pass  pos pass  pos pass  pos  pos  neg
2   pos  neg  neg  pos  pos pass  neg pass pass  neg
3   pos pass  pos  pos pass  pos  pos  pos  pos  pos
4  pass  pos  neg  neg  pos  neg  neg  pos  neg pass
5   neg pass  neg pass  pos  neg  pos  pos pass pass
6  pass  pos  pos pass pass  neg  neg pass  neg pass
7  pass pass  neg pass  neg  neg  pos pass pass  pos
8   pos pass  pos  neg  pos  pos pass  pos  neg  pos
9   pos  pos pass pass pass  pos  neg pass  neg pass
10  neg pass  pos  pos pass  pos pass pass  neg  pos


     X1  X2  X3  X4  X5  X6  X7  X8  X9 X10
X1  1.0 0.3 0.3 0.4 0.3 0.4 0.1 0.4 0.3 0.5
X2  0.3 1.0 0.2 0.2 0.3 0.1 0.3 0.3 0.2 0.3
X3  0.3 0.2 1.0 0.4 0.4 0.6 0.4 0.3 0.2 0.5
X4  0.4 0.2 0.4 1.0 0.3 0.4 0.2 0.5 0.6 0.5
X5  0.3 0.3 0.4 0.3 1.0 0.2 0.3 0.6 0.0 0.3
X6  0.4 0.1 0.6 0.4 0.2 1.0 0.3 0.4 0.5 0.3
X7  0.1 0.3 0.4 0.2 0.3 0.3 1.0 0.3 0.4 0.3
X8  0.4 0.3 0.3 0.5 0.6 0.4 0.3 1.0 0.4 0.4
X9  0.3 0.2 0.2 0.6 0.0 0.5 0.4 0.4 1.0 0.2
X10 0.5 0.3 0.5 0.5 0.3 0.3 0.3 0.4 0.2 1.0

Each entry on this matrix corresponds to the % similarity between each couple of columns. The diagonal is 1 because when you compare a column with itself it has 100% identity, and the matrix is symmetrical because if you compare X1 with X2 it is the same of comparing X2 and X1.

For your second problem, you can just create a new set of dataframes splitting them by Personal_ID with this function:

split(mydata, mydata$Personal_ID)

and then do the same as above.

Surely there are better (less convoluted) solutions but this is the one with the lower amount of lines of code I could produce :D

ADD COMMENT

Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6