Question: Plot average expression profile from one bed file across overlapping regions from another bed file in R
0
gravatar for newscient
9 weeks ago by
newscient10
European Union
newscient10 wrote:

Hello,

I have 2 bed files: one with standard length regions of 400 bp (a.bed) and one with Transcription start sites (TSS) of 1 bp length (b.bed), shown below.

I want to overlay/overlap the two files taking into consideration strandness (same strands) and plot the average expression in tpm (column 5 of b.bed) as y-axis on a single bp resolution of the 400 length regions of a.bed (as x-axis). Anyway I can do that using bedtools and/or R(GRanges/ggplot)? Is some kind of window-based approach mandatory in this case?

So the preferred resulting plot could be something like " Average expression profile vs distance from the midpoint of the 400bp region"

Thanks in advance.

a.bed

chr1 564878 565278 . . - 
chr1 567998 568398 . . - 
chr1 570359 570759 . . - 
chr1 570535 570935 . . +

b.bed

chr1    565702  565703  chr1:565702-565703,-    0.0206969   - 
chr1    566022  566023  chr1:566022-566023,-    0.0206969   -
chr1    566754  566755  chr1:566746-566767,+    0.320802    +
chr1    566783  566784  chr1:566771-566790,-    0.0724392   -
chr1    566893  566894  chr1:566891-566908,+    0.155227    +
chr1    566925  566926  chr1:566924-566930,-    0.124181    -
chr1    566935  566936  chr1:566935-566936,+    0.651953    +
chr1    567067  567068  chr1:567067-567068,+    0.1655753   +
chr1    567583  567584  chr1:567583-567584,-    0.0206969   -

--UPDATE/clarification--

The desired plot would look something like figures 5 and 6 on this link :http://cistrome.org/cr/help.php. An average expression profile from b.bed over the 400 bp regions from a.bed.

overlap granges R bed bedtools • 316 views
ADD COMMENTlink modified 9 weeks ago by Nicolas Rosewick7.0k • written 9 weeks ago by newscient10
1

Are the region in a.bed never overlapping to each other ?

ADD REPLYlink written 9 weeks ago by Nicolas Rosewick7.0k

There are actually overlapping entries inside a.bed. (And there are several overlaps with b.bed, i just show the construct in the example above (answering to your comment in your response below)). But i can subset based on some annotation and have non-overlapping subsets. So you can consider that they don't overlap if that's easier

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by newscient10

Ok then let say that a an entry in b is found in two region of a (thus overlapping). For example region of a : a1: 1-400 and a2: 50-450 ; position of b : 100 . So b will be at a relative position of 100 for a1 and 50 for a2 . So should the single position of b (100) be counted twice ?

ADD REPLYlink written 9 weeks ago by Nicolas Rosewick7.0k

No, that's what i am referring to as a summarised average. In order for it to be meaningful these cases should be added up and divided by the number of counts observed (or just get the expression value from b.bed once) and then averaged across all the 400bps of the a.bed. But in any case I am interested in subsets of a.bed entries that won't be overlapping to eachother so no need to bother with that.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by newscient10

Thanks, but I was thinking of genome wide and not per chromosome resolution. As I said Average expression (y-axis) vs 400 bp region of the a.bed regions. Like a type of ChIP-seq signal over genomic region plot (e.g. some bp around a TSS). Any solutions?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by newscient10

That's what I proposed. For each region in a.bed perform the average TPM in b.bed.

ADD REPLYlink written 9 weeks ago by Nicolas Rosewick7.0k

Maybe I haven't explained it correctly I want to get summarised average expression/expression profile across the region of 400bp. In other words, I want to know where I get expression from b.bed across these 400 bp in a.bed and not on a chromosome basis. To visualise it: Check figure5 and 6 in this link : http://cistrome.org/cr/help.php

ADD REPLYlink written 9 weeks ago by newscient10

Ok that's different ;)

ADD REPLYlink written 9 weeks ago by Nicolas Rosewick7.0k
1
gravatar for Nicolas Rosewick
9 weeks ago by
Belgium, Brussels
Nicolas Rosewick7.0k wrote:

You can try mapToTranscript from GenomicFeatures :

It should work (not tried though). If you can provide us a small dataset with dput() (in R) I can try it ;)

library(GenomicRanges)
library(tidyverse)
library(GenomicFeatures)

a.gr <- GRanges(a[,1],IRanges(a[,2],a[,3],names=1:nrow(a)),strand = a[,6])
b.gr <- GRanges(b[,1],IRanges(b[,2],b[,3]),strand = b[,6])

# extract relative position of each position in b within a regions
res <- as.tibble( mapToTranscripts( b.gr,a.gr,ignore.strand=F))

# add respective TPM for each entry
res$tpm <- b[res$xHits,5]

# group_by position and perform average TPM
res_summary <- res %>% group_by(start) %>% summarise(tpm=mean(tpm))

# plot
ggplot(res_summary,aes(x=start,y=tpm))+geom_point()+geom_line()
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Nicolas Rosewick7.0k

As a newbie with tidyverse, i need to ask. Does this average TPM calculation take into consideration the length of b, where the expression measurements are taken from? I am just thinking that this way we don't actually get an average expression. imagine the case of getting one hit in one region in a and 10 hits in another region.

ADD REPLYlink written 7 weeks ago by newscient10
1
gravatar for Nicolas Rosewick
9 weeks ago by
Belgium, Brussels
Nicolas Rosewick7.0k wrote:

edit : not really what the op asks. See my other answer.

There is no overlap between a and b in your example..

However this should work (not tested though):

library(GenomicRanges)

a.gr <- GRanges(a[,1],IRanges(a[,2],a[,3]),strand = a[,6])
b.gr <- GRanges(b[,1],IRanges(b[,2],b[,3]),strand = b[,6])

# perform overlap 
ov <- findOverlaps( a.gr,b.gr,ignore.strand=F)

# perform the average tpm per region in a.bed
res <- data.frame(tpm = sapply(split(subjectHits(ov),queryHits(ov)),function(x){mean(b[x,5])}))

# merge with a.bed information
res$chr <- a[row.names(res),1]
res$start <- a[row.names(res),2]
res$end <- a[row.names(res),3]

# plot : for the plotting I guess there are more efficient ways than geom_segment...
ggplot(res,aes(x=start,xend=end,y=tpm,yend=tpm))+geom_segment()+facet_grid(~chr)
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Nicolas Rosewick7.0k
1

It's strange there is a bug in the code formatting in biostars. I want to show ov <- findOverlaps( a.gr,b.gr,ignore.strand=F) but it shows only ov <- findOverlapsa.gr,b.gr,ignore.strand=F) . The opening bracket is missing ....

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Nicolas Rosewick7.0k

Happend to me a couple of times !

ADD REPLYlink written 9 weeks ago by Bastien Hervé2.7k

Nicolas Rosewick : That is a known bug with python (and I guess R code). You can put your code in a gist and then link it here to be safe.

ADD REPLYlink written 9 weeks ago by genomax59k
0
gravatar for Alex Reynolds
9 weeks ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

Forward-strand:

$ awk '($6=="+")' a.bed > a.for.bed
$ awk '($6=="+")' b.bed > b.for.bed
$ bedmap --echo --mean --delim '\t' a.for.bed b.for.bed > answer.for.bed

Reverse-strand:

$ awk '($6=="-")' a.bed > a.rev.bed
$ awk '($6=="-")' b.bed > b.rev.bed
$ bedmap --echo --mean --delim '\t' a.rev.bed b.rev.bed > answer.rev.bed

Union:

$ bedops --everything answer.for.bed answer.rev.bed > answer.bed
ADD COMMENTlink written 9 weeks ago by Alex Reynolds26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1632 users visited in the last hour