R: Merge Duplicates In Column In Data.Frame By Condition In Another Column
2
1
Entering edit mode
7.9 years ago
ninninahm ▴ 70

Hi all!

I have the following problem.

  chr13  1260 1275   chr13_38134720_38136919
  chr13  1261 1276   chr13_38134720_38136919
  chr15   839  854   chr15_63332831_63335030
  chr15   840  856   chr15_63332831_63335030
  chr15   837  852   chr15_63332831_63335030
  chr15   842  857   chr15_63332831_63335030

In the 2. and 3. column are positions which I want to combine whenever the value in column 4 is the same. For example, I would want:

  chr13  1260 1276   chr13_38134720_38136919
  chr15   837  857   chr15_63332831_63335030

Any help is highly appreciated!!!

merge r • 16k views
ADD COMMENT
0
Entering edit mode

It sounds like by "combine", you mean the minimum from column 2 and the maximum from column 3.

ADD REPLY
4
Entering edit mode
7.9 years ago
David W 4.8k

This is the sort of problem plyr can make life easier. It does the same "split-apply-combine" approach as the dpryan79's answer, but can wrap it all up in a (I think) nice and consistent syntax

In this case you could do something like

library(plyr)
d = read.table(text="
  chr13  1260 1275   chr13_38134720_38136919
  chr13  1261 1276   chr13_38134720_38136919
  chr15   839  854   chr15_63332831_63335030
  chr15   840  856   chr15_63332831_63335030
  chr15   837  852   chr15_63332831_63335030
  chr15   842  857   chr15_63332831_63335030
")

head(d)
#     V1   V2   V3                      V4
#1 chr13 1260 1275 chr13_38134720_38136919
#2 chr13 1261 1276 chr13_38134720_38136919
#3 chr15  839  854 chr15_63332831_63335030

ddply(d, .(V4), function(x) c( min(x[,2]), max(x[,3]) ) )
#                       V4   V1   V2
#1 chr13_38134720_38136919 1260 1276
#2 chr15_63332831_63335030  837  857
ADD COMMENT
0
Entering edit mode

In fact, this is exactly what I had in mind, only I couldn't remember that it was the plyr package :p This is a great solution.

ADD REPLY
3
Entering edit mode
7.9 years ago

Given a dataframe d:

> d
     V1   V2   V3                      V4
1 chr13 1260 1275 chr13_38134720_38136919
2 chr13 1261 1276 chr13_38134720_38136919
3 chr15  839  854 chr15_63332831_63335030
4 chr15  840  856 chr15_63332831_63335030
5 chr15  837  852 chr15_63332831_63335030
6 chr15  842  857 chr15_63332831_63335030

Split it by V4, apply a function, rotate back to the right shape and then reconvert to a dataframe:

dl <- split(d, d$V4)
d2 <- as.data.frame(t(sapply(dl, function(x) {c(as.character(unique(x$V1)), min(x$V2), max(x$V3), as.character(unique(x$V4)))})))
d2$V2 <- as.numeric(as.character(d2$V2)) #Convert from factors to numbers
d2$V3 <- as.numeric(as.character(d2$V3))

There are likely shorter solutions for the sapply step, but that's the general idea.

Edit: You could also make it a GRanges object and then split() and reduce() that, though if you have regions that aren't remotely near each other that may not work perfectly.

ADD COMMENT
0
Entering edit mode

this throws exception on me, Error in x$V1 : $ operator is invalid for atomic vectors, isn't x there a vector?

ADD REPLY
0
Entering edit mode

I guess I left out the dl <- split(d, d$V4) command.

ADD REPLY
0
Entering edit mode

yes, this is a nice solution. I had no idea of such a cool way to split a data frame. thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6