overlap regions within a file
1
0
Entering edit mode
3.3 years ago
pt.taklifi ▴ 60

I have a file of ATAC seq narrow peaks. like PEPATAC pipeline here is the algorithm :,First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak.

here is a sample of my data

data <- structure(list(V1 = c("chr3", "chrUn_KI270467v1", "chr8", "chrUn_KI270467v1", 
"chr21", "chr7"), V2 = c(93470281L, 1668L, 109333171L, 1668L, 
14382415L, 12686167L), V3 = c(93470873L, 3946L, 109335050L, 3946L, 
14384230L, 12688127L), V4 = c("Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_60893", 
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388b", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_92241d", 
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388c", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_55158c", 
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_83188b"), V5 = c(91371L, 
24480L, 19002L, 17131L, 17084L, 16639L), V6 = c(".", ".", ".", 
".", ".", "."), V7 = c(726.76721, 240.65007, 195.49055, 179.63454, 
179.23312, 175.41965), V8 = c(9144.74609, 2454.88721, 1906.9408, 
1719.66797, 1714.96497, 1670.38489), V9 = c(9137.11816, 2448.07471, 
1900.24915, 1713.15186, 1708.45618, 1663.93958), V10 = c(272L, 
666L, 1082L, 1445L, 898L, 525L)), row.names = c(88715L, 141209L, 
133771L, 141210L, 80584L, 120831L), class = "data.frame")

I sorted my peaks based on their significance so far I wrote a function

 for ( i in 1:nrow(B1.1)){
      sub<-as_granges(B1.1[i , ], seqnames=1, start=2 , end=3)
      que<- as_granges(B1.1 , seqnames=V1 , start=V2 , end=V3)
      overlaps<-which(overlapsAny( que, sub))
      overlaps<-overlaps[overlaps!=i ]
      B1.1<-B1.1[ ! seq(from=1 , to=nrow(B1.1), by=1)%in%overlaps, ]
      overlaps<-c()
    }

Obviously this code is not very efficient and takes a lot of time .

I would appreciate your suggestions

overlap R ATAC_seq • 1.2k views
ADD COMMENT
1
Entering edit mode
3.3 years ago

You just want to keep the first peak of overlapping peaks within the highest value in V8?

Your example data.

library("plyranges")

gr <- as_granges(data, seqnames=V1, start=V2, end=V3)

> gr
GRanges object with 6 ranges and 7 metadata columns:
              seqnames              ranges strand |                     V4
                 <Rle>           <IRanges>  <Rle> |            <character>
  [1]             chr3   93470281-93470873      * | Bcell-BM4983-150120-..
  [2] chrUn_KI270467v1           1668-3946      * | Bcell-BM4983-150120-..
  [3]             chr8 109333171-109335050      * | Bcell-BM4983-150120-..
  [4] chrUn_KI270467v1           1668-3946      * | Bcell-BM4983-150120-..
  [5]            chr21   14382415-14384230      * | Bcell-BM4983-150120-..
  [6]             chr7   12686167-12688127      * | Bcell-BM4983-150120-..
             V5          V6        V7        V8        V9       V10
      <integer> <character> <numeric> <numeric> <numeric> <integer>
  [1]     91371           .   726.767   9144.75   9137.12       272
  [2]     24480           .   240.650   2454.89   2448.07       666
  [3]     19002           .   195.491   1906.94   1900.25      1082
  [4]     17131           .   179.635   1719.67   1713.15      1445
  [5]     17084           .   179.233   1714.96   1708.46       898
  [6]     16639           .   175.420   1670.38   1663.94       525
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

Solution.

filtered <- gr %>%
  reduce_ranges %>%
  group_by_overlaps(gr) %>%
  filter(V8 == max(V8))

> filtered
GRanges object with 5 ranges and 8 metadata columns:
Groups: query [5]
              seqnames              ranges strand |                     V4
                 <Rle>           <IRanges>  <Rle> |            <character>
  [1]             chr3   93470281-93470873      * | Bcell-BM4983-150120-..
  [2] chrUn_KI270467v1           1668-3946      * | Bcell-BM4983-150120-..
  [3]             chr8 109333171-109335050      * | Bcell-BM4983-150120-..
  [4]            chr21   14382415-14384230      * | Bcell-BM4983-150120-..
  [5]             chr7   12686167-12688127      * | Bcell-BM4983-150120-..
             V5          V6        V7        V8        V9       V10     query
      <integer> <character> <numeric> <numeric> <numeric> <integer> <integer>
  [1]     91371           .   726.767   9144.75   9137.12       272         1
  [2]     24480           .   240.650   2454.89   2448.07       666         2
  [3]     19002           .   195.491   1906.94   1900.25      1082         3
  [4]     17084           .   179.233   1714.96   1708.46       898         4
  [5]     16639           .   175.420   1670.38   1663.94       525         5
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

yes exactly . can you please explain your code? thank you

ADD REPLY
1
Entering edit mode

reduce_ranges merges all overlapping ranges into consensus ranges. group_by_overlaps then groups each peak in the original data based on what merged peak (from the previous step) it overlaps. slice then selects the first peak in these groups of peaks.

ADD REPLY
0
Entering edit mode

well actually this is not exactly what I'm looking for. here is the algorithm : First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak.

ADD REPLY
0
Entering edit mode

I modified the code so that it keeps only the peak with the largest value in the V8 columns, which is what I believe your code was pointing to.

ADD REPLY
0
Entering edit mode

I'm not sure how reduce_ranges work. but I assume that if I have 3 ranges "1" , "2" , "3" and if "1" and "2" are overlapping and "2" and "3" are overlapping but "1" and "3" don't overlap, it still gives a consensus region merging all 3 ranges. but what I wanna do is different. in the case of this example first I order my ranges by their score(decreasing=TRUE) in the first iteration I want to only delete range "2" and keep range "3" . next I want to search for all overlapping ranges with the next most significant range which is range "3" and I will delete only ranges that directly overlap range "3". I'm sorry if I couldn’t explain more clearly.

ADD REPLY

Login before adding your answer.

Traffic: 2585 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6