Question: R: Merge Duplicates In Column In Data.Frame By Condition In Another Column
1
gravatar for ninninahm
3.5 years ago by
ninninahm50
ninninahm50 wrote:

Hi all!

I have the following problem.

  chr13  1260 1275   chr13_38134720_38136919
  chr13  1261 1276   chr13_38134720_38136919
  chr15   839  854   chr15_63332831_63335030
  chr15   840  856   chr15_63332831_63335030
  chr15   837  852   chr15_63332831_63335030
  chr15   842  857   chr15_63332831_63335030

In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want:

  chr13  1260 1276   chr13_38134720_38136919
  chr15   837  857   chr15_63332831_63335030

Any help is highly appreciated!!!

R merge • 5.2k views
ADD COMMENTlink modified 3.5 years ago by David W4.6k • written 3.5 years ago by ninninahm50

It sounds like by "combine", you mean the minimum from column 2 and the maximum from column 3.

ADD REPLYlink written 3.5 years ago by Neilfws47k
3
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan68k
Freiburg, Germany
Devon Ryan68k wrote:

Given a dataframe d:

> d
     V1   V2   V3                      V4
1 chr13 1260 1275 chr13_38134720_38136919
2 chr13 1261 1276 chr13_38134720_38136919
3 chr15  839  854 chr15_63332831_63335030
4 chr15  840  856 chr15_63332831_63335030
5 chr15  837  852 chr15_63332831_63335030
6 chr15  842  857 chr15_63332831_63335030

Split it by V4, apply a function, rotate back to the right shape and then reconvert to a dataframe:

dl <- split(d, d$V4)
d2 <- as.data.frame(t(sapply(dl, function(x) {c(as.character(unique(x$V1)), min(x$V2), max(x$V3), as.character(unique(x$V4)))})))
d2$V2 <- as.numeric(as.character(d2$V2)) #Convert from factors to numbers
d2$V3 <- as.numeric(as.character(d2$V3))

There are likely shorter solutions for the sapply step, but that's the general idea.

Edit: You could also make it a GRanges object and then split() and reduce() that, though if you have regions that aren't remotely near each other that may not work perfectly.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Devon Ryan68k

this throws exception on me, Error in x$V1 : $ operator is invalid for atomic vectors, isn't x there a vector?

ADD REPLYlink written 3.5 years ago by Pavel Senin1.8k

I guess I left out the dl <- split(d, d$V4) command.

ADD REPLYlink written 3.5 years ago by Devon Ryan68k

yes, this is a nice solution. I had no idea of such a cool way to split a data frame. thank you!

ADD REPLYlink written 3.5 years ago by Pavel Senin1.8k
3
gravatar for David W
3.5 years ago by
David W4.6k
New Zealand
David W4.6k wrote:

This is the sort of problem plyr can make life easier. It does the same "split-apply-combine" approach as the dpryan79's answer, but can wrap it all up in a (I think) nice and consistent syntax

In this case you could do something like

library(plyr)
d = read.table(text="
  chr13  1260 1275   chr13_38134720_38136919
  chr13  1261 1276   chr13_38134720_38136919
  chr15   839  854   chr15_63332831_63335030
  chr15   840  856   chr15_63332831_63335030
  chr15   837  852   chr15_63332831_63335030
  chr15   842  857   chr15_63332831_63335030
")

head(d)
#     V1   V2   V3                      V4
#1 chr13 1260 1275 chr13_38134720_38136919
#2 chr13 1261 1276 chr13_38134720_38136919
#3 chr15  839  854 chr15_63332831_63335030

ddply(d, .(V4), function(x) c( min(x[,2]), max(x[,3]) ) )
#                       V4   V1   V2
#1 chr13_38134720_38136919 1260 1276
#2 chr15_63332831_63335030  837  857
ADD COMMENTlink written 3.5 years ago by David W4.6k

In fact, this is exactly what I had in mind, only I couldn't remember that it was the plyr package :p This is a great solution.

ADD REPLYlink written 3.5 years ago by Devon Ryan68k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 647 users visited in the last hour