Plot average expression profile from one bed file across overlapping regions from another bed file in R
3
0
Entering edit mode
3.2 years ago
newscient ▴ 20

Hello,

I have 2 bed files: one with standard length regions of 400 bp (a.bed) and one with Transcription start sites (TSS) of 1 bp length (b.bed), shown below.

I want to overlay/overlap the two files taking into consideration strandness (same strands) and plot the average expression in tpm (column 5 of b.bed) as y-axis on a single bp resolution of the 400 length regions of a.bed (as x-axis). Anyway I can do that using bedtools and/or R(GRanges/ggplot)? Is some kind of window-based approach mandatory in this case?

So the preferred resulting plot could be something like " Average expression profile vs distance from the midpoint of the 400bp region"

a.bed

chr1 564878 565278 . . -
chr1 567998 568398 . . -
chr1 570359 570759 . . -
chr1 570535 570935 . . +


b.bed

chr1    565702  565703  chr1:565702-565703,-    0.0206969   -
chr1    566022  566023  chr1:566022-566023,-    0.0206969   -
chr1    566754  566755  chr1:566746-566767,+    0.320802    +
chr1    566783  566784  chr1:566771-566790,-    0.0724392   -
chr1    566893  566894  chr1:566891-566908,+    0.155227    +
chr1    566925  566926  chr1:566924-566930,-    0.124181    -
chr1    566935  566936  chr1:566935-566936,+    0.651953    +
chr1    567067  567068  chr1:567067-567068,+    0.1655753   +
chr1    567583  567584  chr1:567583-567584,-    0.0206969   -


--UPDATE/clarification--

The desired plot would look something like figures 5 and 6 on this link :http://cistrome.org/cr/help.php. An average expression profile from b.bed over the 400 bp regions from a.bed.

bed R bedtools overlap GRanges • 1.6k views
1
Entering edit mode

Are the region in a.bed never overlapping to each other ?

0
Entering edit mode

There are actually overlapping entries inside a.bed. (And there are several overlaps with b.bed, i just show the construct in the example above (answering to your comment in your response below)). But i can subset based on some annotation and have non-overlapping subsets. So you can consider that they don't overlap if that's easier

0
Entering edit mode

Ok then let say that a an entry in b is found in two region of a (thus overlapping). For example region of a : a1: 1-400 and a2: 50-450 ; position of b : 100 . So b will be at a relative position of 100 for a1 and 50 for a2 . So should the single position of b (100) be counted twice ?

0
Entering edit mode

No, that's what i am referring to as a summarised average. In order for it to be meaningful these cases should be added up and divided by the number of counts observed (or just get the expression value from b.bed once) and then averaged across all the 400bps of the a.bed. But in any case I am interested in subsets of a.bed entries that won't be overlapping to eachother so no need to bother with that.

0
Entering edit mode

Thanks, but I was thinking of genome wide and not per chromosome resolution. As I said Average expression (y-axis) vs 400 bp region of the a.bed regions. Like a type of ChIP-seq signal over genomic region plot (e.g. some bp around a TSS). Any solutions?

0
Entering edit mode

That's what I proposed. For each region in a.bed perform the average TPM in b.bed.

0
Entering edit mode

Maybe I haven't explained it correctly I want to get summarised average expression/expression profile across the region of 400bp. In other words, I want to know where I get expression from b.bed across these 400 bp in a.bed and not on a chromosome basis. To visualise it: Check figure5 and 6 in this link : http://cistrome.org/cr/help.php

0
Entering edit mode

Ok that's different ;)

1
Entering edit mode
3.1 years ago

You can try mapToTranscript from GenomicFeatures :

It should work (not tried though). If you can provide us a small dataset with dput() (in R) I can try it ;)

library(GenomicRanges)
library(tidyverse)
library(GenomicFeatures)

a.gr <- GRanges(a[,1],IRanges(a[,2],a[,3],names=1:nrow(a)),strand = a[,6])
b.gr <- GRanges(b[,1],IRanges(b[,2],b[,3]),strand = b[,6])

# extract relative position of each position in b within a regions
res <- as.tibble( mapToTranscripts( b.gr,a.gr,ignore.strand=F))

# add respective TPM for each entry
res$tpm <- b[res$xHits,5]

# group_by position and perform average TPM
res_summary <- res %>% group_by(start) %>% summarise(tpm=mean(tpm))

# plot
ggplot(res_summary,aes(x=start,y=tpm))+geom_point()+geom_line()

0
Entering edit mode

As a newbie with tidyverse, i need to ask. Does this average TPM calculation take into consideration the length of b, where the expression measurements are taken from? I am just thinking that this way we don't actually get an average expression. imagine the case of getting one hit in one region in a and 10 hits in another region.

1
Entering edit mode
3.2 years ago

edit : not really what the op asks. See my other answer.

There is no overlap between a and b in your example..

However this should work (not tested though):

library(GenomicRanges)

a.gr <- GRanges(a[,1],IRanges(a[,2],a[,3]),strand = a[,6])
b.gr <- GRanges(b[,1],IRanges(b[,2],b[,3]),strand = b[,6])

# perform overlap
ov <- findOverlaps( a.gr,b.gr,ignore.strand=F)

# perform the average tpm per region in a.bed
res <- data.frame(tpm = sapply(split(subjectHits(ov),queryHits(ov)),function(x){mean(b[x,5])}))

# merge with a.bed information
res$chr <- a[row.names(res),1] res$start <- a[row.names(res),2]
res$end <- a[row.names(res),3] # plot : for the plotting I guess there are more efficient ways than geom_segment... ggplot(res,aes(x=start,xend=end,y=tpm,yend=tpm))+geom_segment()+facet_grid(~chr)  ADD COMMENT 1 Entering edit mode It's strange there is a bug in the code formatting in biostars. I want to show ov <- findOverlaps( a.gr,b.gr,ignore.strand=F) but it shows only ov <- findOverlapsa.gr,b.gr,ignore.strand=F) . The opening bracket is missing .... ADD REPLY 0 Entering edit mode Happend to me a couple of times ! ADD REPLY 0 Entering edit mode Nicolas Rosewick : That is a known bug with python (and I guess R code). You can put your code in a gist and then link it here to be safe. ADD REPLY 1 Entering edit mode 3.1 years ago Forward-strand: $ awk '($6=="+")' a.bed > a.for.bed$ awk '($6=="+")' b.bed > b.for.bed$ bedmap --echo --mean --delim '\t' a.for.bed b.for.bed > answer.for.bed


Reverse-strand:

$awk '($6=="-")' a.bed > a.rev.bed
$awk '($6=="-")' b.bed > b.rev.bed
$bedmap --echo --mean --delim '\t' a.rev.bed b.rev.bed > answer.rev.bed  Union: $ bedops --everything answer.for.bed answer.rev.bed > answer.bed