Question

How to associate a row from one data frame with a value in another data frame based on three columns

0

Entering edit mode

6.4 years ago

baizm ▴ 10

I have two data frames, df1 one has a list of gene variants from a vcf file and df2 two has a list of predicted genes in a genome assembly. Each variant in df1 occurs within one of the predicted genes in df2. I want to associate each variant with the gene it occurs in.

Here is the first few lines of what they look like:

df1<-data.frame(contig=c(rep('contig_0', 6)),pos=c(899983,937283,951771,991102,1034215,1063818))
df2<-data.frame(pred_gene=c('g1','g2','g3','g4','g5','g6'),
    contig=c('contig_0','contig_0','contig_0','contig_0','contig_2','contig_2'),
    start=c(355079,446820,700794,887159,110971,156060),
    stop=c(355336,462604,707341,1236478,112320,284753))

What I want is to make a third column in df1 with the associated pred_gene from df2. In this case, each variant is in pred_gene g4. There are thousands of contigs represented in df2 and many predicted genes have more than one variant in them.

R • 2.0k views

ADD COMMENT • link updated 6.4 years ago by cpad0112 21k • written 6.4 years ago by baizm ▴ 10

2

Entering edit mode

seeing all these diverse answers rolling in it'd be great if you could benchmark them on your large data set!

ADD REPLY • link 6.4 years ago by Friederike 8.9k

1

Entering edit mode

Please format your code in the future. Makes it a lot easier to read.

ADD REPLY • link 6.4 years ago by Ram 43k

0

Entering edit mode

You say each variant in df1 occurs within one predicted gene, but the contig_0 and contig_2 have pred_gene g1-4 and g5-6, respectively.

ADD REPLY • link 6.4 years ago by Hussain Ather ▴ 990

1

Entering edit mode

Sorry, what? df1 only has contig_0. And you can see all of df1's entries match up to g4, as OP says. All df1$pos values fall between g4's start and stop

ADD REPLY • link 6.4 years ago by Ram 43k

score 2 · Answer 1 · 2017-11-08

2

Entering edit mode

6.4 years ago

Kevin Blighe 87k

Here is the solution. They do indeed all map to G4 in this particular example.

require("GenomicRanges")
df1.GR <- makeGRangesFromDataFrame(df1, seqnames.field="contig", start.field="pos", end.field="pos", keep.extra.columns=TRUE)
df2.GR <- makeGRangesFromDataFrame(df2, seqnames.field="contig", start.field="start", end.field="stop", keep.extra.columns=TRUE)

overlaps <- findOverlaps( df1.GR, df2.GR, type="within" )
data.frame(df1[queryHits(overlaps),], df2[subjectHits(overlaps),])

    contig     pos pred_gene contig.1  start    stop
1 contig_0  899983        g4 contig_0 887159 1236478
2 contig_0  937283        g4 contig_0 887159 1236478
3 contig_0  951771        g4 contig_0 887159 1236478
4 contig_0  991102        g4 contig_0 887159 1236478
5 contig_0 1034215        g4 contig_0 887159 1236478
6 contig_0 1063818        g4 contig_0 887159 1236478

ADD COMMENT • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Works like a charm, thank you!

ADD REPLY • link 6.4 years ago by baizm ▴ 10

0

Entering edit mode

Please try the other solutions, if you have time, and then upvote and/or accept. This way people coming here in the future will instantly know which solutions worked.

Thanks!

Kevin

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Is this faster than the sapply I've used below? It's definitely more bioinformatics-y, but what are the pros and cons of our approaches?

ADD REPLY • link 6.4 years ago by Ram 43k

2

Entering edit mode

Possibly just as quick. It's good to have different solutions. I answer a lot of these types of questions just to keep my programming skills updated

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

It's doing the parentheses thing again!

findOperlaps(df1.. won't be rendered as

findOverlaps(df1,...

ADD REPLY • link 6.4 years ago by Ram 43k

0

Entering edit mode

Good spot, Ram! I have edited it. Strange quirk in the formatting

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

score 1 · Answer 2 · 2017-11-08

You need apply or one of its variants. I'm using sapply here

df1$gene=sapply(X=df1$pos, FUN=function(x) { df2[df2$start<x & df2$stop>x,"pred_gene"]})

Also, one of my options is set as follows (globally):

options(stringsAsFactors=FALSE)

I'm mentioning that in case you encounter problems - you might need to recreate data frame with that option enabled.

score 1 · Answer 3 · 2017-11-09

1

Entering edit mode

6.4 years ago

cpad0112 21k

>library(sqldf)    
> sqldf("select * from df1 left join df2 on  df1.pos between df2.start and df2.stop")
        contig     pos pred_gene   contig  start    stop
    1 contig_0  899983        g4 contig_0 887159 1236478
    2 contig_0  937283        g4 contig_0 887159 1236478
    3 contig_0  951771        g4 contig_0 887159 1236478
    4 contig_0  991102        g4 contig_0 887159 1236478
    5 contig_0 1034215        g4 contig_0 887159 1236478
    6 contig_0 1063818        g4 contig_0 887159 1236478

ADD COMMENT • link 6.4 years ago by cpad0112 21k

1

Entering edit mode

ZOMG there's something that lets me do SQL with data frames?!?!?! R is insanity, I tell you!

ADD REPLY • link 6.4 years ago by Ram 43k

score 1 · Answer 4 · 2017-11-09

just to add to the diversity of answers, here's a solution using the data.table package, which tends to have blazingly fast operations.

library(data.table)
dt1 <- data.table(contig = c(rep('contig_0', 6)),
                             start = c(899983,937283,951771,991102,1034215,1063818),
                            stop = c(899983,937283,951771,991102,1034215,1063818))
dt2 <- data.table(pred_gene = c('g1','g2','g3','g4','g5','g6'),
                           contig = c('contig_0','contig_0','contig_0','contig_0','contig_2','contig_2'),
                           start = c(355079,446820,700794,887159,110971,156060),
                            stop = c(355336,462604,707341,1236478,112320,284753))
setkey(dt2, start, stop) # this assumes that dt2 has fewer, but larger ranges than dt1
foverlaps(dt1, dt2)

result:

   pred_gene   contig  start    stop i.contig i.start  i.stop
1:        g4 contig_0 887159 1236478 contig_0  899983  899983
2:        g4 contig_0 887159 1236478 contig_0  937283  937283
3:        g4 contig_0 887159 1236478 contig_0  951771  951771
4:        g4 contig_0 887159 1236478 contig_0  991102  991102
5:        g4 contig_0 887159 1236478 contig_0 1034215 1034215
6:        g4 contig_0 887159 1236478 contig_0 1063818 1063818

score 1 · Answer 5 · 2017-11-09

Another solution:

> library(fuzzyjoin)
> fuzzy_join(df1, df2, by = c("pos" = "start", "pos" = "stop"), match_fun = list(`>=`, `<=`))

output:

> fuzzy_join(df1, df2, by = c("pos" = "start", "pos" = "stop"), match_fun = list(`>=`, `<=`))
  contig.x     pos pred_gene contig.y  start    stop
1 contig_0  899983        g4 contig_0 887159 1236478
2 contig_0  937283        g4 contig_0 887159 1236478
3 contig_0  951771        g4 contig_0 887159 1236478
4 contig_0  991102        g4 contig_0 887159 1236478
5 contig_0 1034215        g4 contig_0 887159 1236478
6 contig_0 1063818        g4 contig_0 887159 1236478