Cluster the intervals but keep strand and return the average score
1
0
Entering edit mode
9.7 years ago
ifreecell ▴ 220

Hi, I got a file in the following format

chr1     33711     33712     +     3.29
chr1     33712     33713     +     3.31
chr1     33713     33714     +     3.33
chr1     33714     33715     +     3.34
chr1     33715     33716     +     3.33
chr1     33716     33717     +     3.32

I don't think this file is so compact, so I want change it to something like

chr1     33711     33717     +     3.32

I tried clustering the intervals using Galaxy, but it just returned the first three columns

chr1     33711     33716

I really need to keep the strand and score column, because later I will sort the file based on the score. So is there any script or tools can do the job? It's better to have a option to return the mean or average value of that range in the fifth column.

Here is a sample file waiting to be tested.

Bed wig • 1.9k views
ADD COMMENT
2
Entering edit mode
9.7 years ago

Here's an R solution (I've made it a bit longer than needed to make it easier to follow), though you could just iterate over things in python or perl.

library(rtracklayer)
foo <- import.bed("S2_RF25_2.54._score.bed")
foo2 <- reduce(foo) #Merge neighboring positions while noting strand
o <- findOverlaps(foo,foo2)
scores <- split(foo$score, subjectHits(o))
foo2$scores <- unlist(lapply(scores, mean))

foo2 can then be exported. You can tweak the settings for reduce() if you want to allow a larger gap.

ADD COMMENT

Login before adding your answer.

Traffic: 2170 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6