Question

R: Merge Duplicates In Column In Data.Frame By Condition In Another Column

1

Entering edit mode

10.3 years ago

ninninahm ▴ 70

Hi all!

I have the following problem.

  chr13  1260 1275   chr13_38134720_38136919
  chr13  1261 1276   chr13_38134720_38136919
  chr15   839  854   chr15_63332831_63335030
  chr15   840  856   chr15_63332831_63335030
  chr15   837  852   chr15_63332831_63335030
  chr15   842  857   chr15_63332831_63335030

In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want:

  chr13  1260 1276   chr13_38134720_38136919
  chr15   837  857   chr15_63332831_63335030

Any help is highly appreciated!!!

merge r • 20k views

ADD COMMENT • link updated 10.2 years ago by David W 4.9k • written 10.3 years ago by ninninahm ▴ 70

0

Entering edit mode

It sounds like by "combine", you mean the minimum from column 2 and the maximum from column 3.

ADD REPLY • link 10.3 years ago by Neilfws 49k

score 4 · Answer 1 · 2014-01-21

This is the sort of problem plyr can make life easier. It does the same "split-apply-combine" approach as the dpryan79's answer, but can wrap it all up in a (I think) nice and consistent syntax

In this case you could do something like

library(plyr)
d = read.table(text="
  chr13  1260 1275   chr13_38134720_38136919
  chr13  1261 1276   chr13_38134720_38136919
  chr15   839  854   chr15_63332831_63335030
  chr15   840  856   chr15_63332831_63335030
  chr15   837  852   chr15_63332831_63335030
  chr15   842  857   chr15_63332831_63335030
")

head(d)
#     V1   V2   V3                      V4
#1 chr13 1260 1275 chr13_38134720_38136919
#2 chr13 1261 1276 chr13_38134720_38136919
#3 chr15  839  854 chr15_63332831_63335030

ddply(d, .(V4), function(x) c( min(x[,2]), max(x[,3]) ) )
#                       V4   V1   V2
#1 chr13_38134720_38136919 1260 1276
#2 chr15_63332831_63335030  837  857

score 3 · Answer 2 · 2014-01-20

3

Entering edit mode

10.3 years ago

Devon Ryan 104k

Given a dataframe d:

> d
     V1   V2   V3                      V4
1 chr13 1260 1275 chr13_38134720_38136919
2 chr13 1261 1276 chr13_38134720_38136919
3 chr15  839  854 chr15_63332831_63335030
4 chr15  840  856 chr15_63332831_63335030
5 chr15  837  852 chr15_63332831_63335030
6 chr15  842  857 chr15_63332831_63335030

Split it by V4, apply a function, rotate back to the right shape and then reconvert to a dataframe:

dl <- split(d, d$V4)
d2 <- as.data.frame(t(sapply(dl, function(x) {c(as.character(unique(x$V1)), min(x$V2), max(x$V3), as.character(unique(x$V4)))})))
d2$V2 <- as.numeric(as.character(d2$V2)) #Convert from factors to numbers
d2$V3 <- as.numeric(as.character(d2$V3))

There are likely shorter solutions for the sapply step, but that's the general idea.

Edit: You could also make it a GRanges object and then split() and reduce() that, though if you have regions that aren't remotely near each other that may not work perfectly.

ADD COMMENT • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

this throws exception on me, Error in x$V1 : $ operator is invalid for atomic vectors, isn't x there a vector?

ADD REPLY • link 10.3 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

I guess I left out the dl <- split(d, d$V4) command.

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

yes, this is a nice solution. I had no idea of such a cool way to split a data frame. thank you!

ADD REPLY • link 10.3 years ago by Pavel Senin ★ 1.9k