Question: Question: R programming: group and then compare columns to column and get the mismatch
3
gravatar for d04103552
3.8 years ago by
d0410355230
d0410355230 wrote:

Hi Guys,

I have a data-frame "mydata" with 200 columns, and over 15.000 Rows

1. I would like to compare the two columns starting from the first column (XR_res1) and check if the contents match. For example, I want to compare column XR_res3 with column 1XR_res5 and get the concordance column in result with match or mismatch decision ,

2. Group by "Personal_ID" all 15.000 rows and compare mismatches with column result of match or mismatch decision in percents

Thanks

Personal_ID XR_res1 XR_res2 XR_res3 XR_res4 XR_res5 XR_res6
001 pos pass pos neg pos neg
001 pos pass neg pass pass pass
001 neg neg neg pos neg pass
002 pass pos pos pass pass pos
002 pos pass pass neg pass pos
003 pass neg neg pos pass pos
003 pos neg pass pass pos pos
003 pass pos pos pass pass neg
003 neg pos neg pass pos neg

 

programming R • 1.5k views
ADD COMMENTlink modified 3.7 years ago by Selenocysteine560 • written 3.8 years ago by d0410355230
1
gravatar for Sean Davis
3.8 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

It isn't entirely clear what you are trying to calculate, but take a look at using the "gower" distance measure.  Here is an example

> set.seed(123)
> library(cluster)
> x = data.frame(matrix(sample(c('neg','pos','pass'),100,replace=TRUE),nc=10))
> x
     X1   X2   X3   X4   X5   X6   X7   X8   X9  X10
1   neg pass pass pass  neg  neg  pos pass  neg  neg
2  pass  pos pass pass  pos  pos  neg  pos pass  pos
3   pos pass  pos pass  pos pass  pos pass  pos  pos
4  pass  pos pass pass  pos  neg  neg  neg pass  pos
5  pass  neg  pos  neg  neg  pos pass  pos  neg  neg
6   neg pass pass  pos  neg  neg  pos  neg  pos  neg
7   pos  neg  pos pass  neg  neg pass  pos pass pass
8  pass  neg  pos  neg  pos pass pass  pos pass  neg
9   pos  neg  neg  neg  neg pass pass  pos pass  pos
10  pos pass  neg  neg pass  pos  pos  neg  neg  pos
> 1-daisy(x,'gower') # so that we are measuring similarity, not distance (easier to count)
Dissimilarities :
     1   2   3   4   5   6   7   8   9
2  0.2                                
3  0.4 0.3                            
4  0.3 0.8 0.3                        
5  0.3 0.3 0.1 0.1                    
6  0.7 0.1 0.3 0.3 0.2                
7  0.3 0.3 0.3 0.3 0.5 0.2            
8  0.1 0.4 0.3 0.3 0.7 0.1 0.5        
9  0.1 0.3 0.3 0.2 0.5 0.1 0.6 0.6    
10 0.3 0.2 0.4 0.2 0.3 0.3 0.1 0.1 0.4

Metric :  mixed ;  Types = N, N, N, N, N, N, N, N, N, N 
Number of objects : 10

To get column similarities (or distances), simply transpose your matrix (assuming that everything uses the same three potential values).

 

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Sean Davis25k

Hi Sean, Thanks for your answer, This solution isn't really working for me, 

Sorry I wasn't clear enough in describing the problem

1. as a first stem I simply need to compare 2 locumns (with non numeric values) to each other for similarity

for instance comparing XR_res3 with column XR_res5 to check and get a result of mismatch in percents, here i probably need to use some like all.equal but can't figure out the sintax.

​2. as a step two I need to group data by Personal_ID's  and calculate mismatch per grupes

so should compare XR_res3 with column XR_res5  for group with Personal_ID 001, then XR_res3 with column XR_res5  for group with Personal_ID 002 and so on,

For that purpose I will build a new data some like : 

dt.1<- subset(mydata, select=c("Personal_ID"))

dt.2<- subset(mydata, select=c("XR_res3"))

dt.3<- subset(mydata, select=c("XR_res5"))

dt.x<- cbind(dt.1,dt.2,dt.3)

and my final result should look some like

Personal_ID XR_res3 XR_res5 Mismatch Result
001 pos neg 0.50
001 pos pos  
001 neg pos  
002 neg pass 1.0
002 neg pass  
003 pos pos 0.0
003 pass pass  
003 pass pass  
003 neg neg  

 

I suppose I should use data.table[,, by= Personal_ID]  but again can't figure out the syntax 

Thanks

ADD REPLYlink written 3.8 years ago by d0410355230
0
gravatar for Selenocysteine
3.7 years ago by
Dublin, Ireland
Selenocysteine560 wrote:

For your first problem you can do something like this (I took Sean's code to generate the dummy data):

# Set random seed for reproducibility
set.seed(1)

# Generate dummy data
mydata = data.frame(matrix(sample(c('neg','pos','pass'), 100, replace = TRUE), nc = 10))

# Similarity matrix
sim = apply(mydata, 2, function(x) colSums(x == mydata)/nrow(mydata))

> mydata
     X1   X2   X3   X4   X5   X6   X7   X8   X9  X10
1   neg  neg pass  pos pass  pos pass  pos  pos  neg
2   pos  neg  neg  pos  pos pass  neg pass pass  neg
3   pos pass  pos  pos pass  pos  pos  pos  pos  pos
4  pass  pos  neg  neg  pos  neg  neg  pos  neg pass
5   neg pass  neg pass  pos  neg  pos  pos pass pass
6  pass  pos  pos pass pass  neg  neg pass  neg pass
7  pass pass  neg pass  neg  neg  pos pass pass  pos
8   pos pass  pos  neg  pos  pos pass  pos  neg  pos
9   pos  pos pass pass pass  pos  neg pass  neg pass
10  neg pass  pos  pos pass  pos pass pass  neg  pos


     X1  X2  X3  X4  X5  X6  X7  X8  X9 X10
X1  1.0 0.3 0.3 0.4 0.3 0.4 0.1 0.4 0.3 0.5
X2  0.3 1.0 0.2 0.2 0.3 0.1 0.3 0.3 0.2 0.3
X3  0.3 0.2 1.0 0.4 0.4 0.6 0.4 0.3 0.2 0.5
X4  0.4 0.2 0.4 1.0 0.3 0.4 0.2 0.5 0.6 0.5
X5  0.3 0.3 0.4 0.3 1.0 0.2 0.3 0.6 0.0 0.3
X6  0.4 0.1 0.6 0.4 0.2 1.0 0.3 0.4 0.5 0.3
X7  0.1 0.3 0.4 0.2 0.3 0.3 1.0 0.3 0.4 0.3
X8  0.4 0.3 0.3 0.5 0.6 0.4 0.3 1.0 0.4 0.4
X9  0.3 0.2 0.2 0.6 0.0 0.5 0.4 0.4 1.0 0.2
X10 0.5 0.3 0.5 0.5 0.3 0.3 0.3 0.4 0.2 1.0

Each entry on this matrix corresponds to the % similarity between each couple of columns. The diagonal is 1 because when you compare a column with itself it has 100% identity, and the matrix is symmetrical because if you compare X1 with X2 it is the same of comparing X2 and X1.

For your second problem, you can just create a new set of dataframes splitting them by Personal_ID with this function:

split(mydata, mydata$Personal_ID)

and then do the same as above.

Surely there are better (less convoluted) solutions but this is the one with the lower amount of lines of code I could produce :D

ADD COMMENTlink written 3.7 years ago by Selenocysteine560
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1983 users visited in the last hour