Question

Cluster the intervals but keep strand and return the average score

0

Entering edit mode

9.7 years ago

ifreecell ▴ 220

Hi, I got a file in the following format

chr1     33711     33712     +     3.29
chr1     33712     33713     +     3.31
chr1     33713     33714     +     3.33
chr1     33714     33715     +     3.34
chr1     33715     33716     +     3.33
chr1     33716     33717     +     3.32

I don't think this file is so compact, so I want change it to something like

chr1     33711     33717     +     3.32

I tried clustering the intervals using Galaxy, but it just returned the first three columns

chr1     33711     33716

I really need to keep the strand and score column, because later I will sort the file based on the score. So is there any script or tools can do the job? It's better to have a option to return the mean or average value of that range in the fifth column.

Here is a sample file waiting to be tested.

Bed wig • 1.9k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by ifreecell ▴ 220

score 2 · Accepted Answer · 2014-08-14

Here's an R solution (I've made it a bit longer than needed to make it easier to follow), though you could just iterate over things in python or perl.

library(rtracklayer)
foo <- import.bed("S2_RF25_2.54._score.bed")
foo2 <- reduce(foo) #Merge neighboring positions while noting strand
o <- findOverlaps(foo,foo2)
scores <- split(foo$score, subjectHits(o))
foo2$scores <- unlist(lapply(scores, mean))

foo2 can then be exported. You can tweak the settings for reduce() if you want to allow a larger gap.