Question

overlap regions within a file

0

Entering edit mode

4.6 years ago

pt.taklifi ▴ 70

I have a file of ATAC seq narrow peaks. like PEPATAC pipeline here is the algorithm :,First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak.

here is a sample of my data

data <- structure(list(V1 = c("chr3", "chrUn_KI270467v1", "chr8", "chrUn_KI270467v1", 
"chr21", "chr7"), V2 = c(93470281L, 1668L, 109333171L, 1668L, 
14382415L, 12686167L), V3 = c(93470873L, 3946L, 109335050L, 3946L, 
14384230L, 12688127L), V4 = c("Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_60893", 
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388b", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_92241d", 
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_97388c", "Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_55158c", 
"Bcell-BM4983-150120-RepA.pe.q10.sort.rmdup_peak_83188b"), V5 = c(91371L, 
24480L, 19002L, 17131L, 17084L, 16639L), V6 = c(".", ".", ".", 
".", ".", "."), V7 = c(726.76721, 240.65007, 195.49055, 179.63454, 
179.23312, 175.41965), V8 = c(9144.74609, 2454.88721, 1906.9408, 
1719.66797, 1714.96497, 1670.38489), V9 = c(9137.11816, 2448.07471, 
1900.24915, 1713.15186, 1708.45618, 1663.93958), V10 = c(272L, 
666L, 1082L, 1445L, 898L, 525L)), row.names = c(88715L, 141209L, 
133771L, 141210L, 80584L, 120831L), class = "data.frame")

I sorted my peaks based on their significance so far I wrote a function

 for ( i in 1:nrow(B1.1)){
      sub<-as_granges(B1.1[i , ], seqnames=1, start=2 , end=3)
      que<- as_granges(B1.1 , seqnames=V1 , start=V2 , end=V3)
      overlaps<-which(overlapsAny( que, sub))
      overlaps<-overlaps[overlaps!=i ]
      B1.1<-B1.1[ ! seq(from=1 , to=nrow(B1.1), by=1)%in%overlaps, ]
      overlaps<-c()
    }

Obviously this code is not very efficient and takes a lot of time .

I would appreciate your suggestions

overlap R ATAC_seq • 1.6k views

ADD COMMENT • link 4.6 years ago by pt.taklifi ▴ 70

score 1 · Answer 1 · 2020-12-16

You just want to keep the first peak of overlapping peaks within the highest value in V8?

Your example data.

library("plyranges")

gr <- as_granges(data, seqnames=V1, start=V2, end=V3)

> gr
GRanges object with 6 ranges and 7 metadata columns:
              seqnames              ranges strand |                     V4
                 <Rle>           <IRanges>  <Rle> |            <character>
  [1]             chr3   93470281-93470873      * | Bcell-BM4983-150120-..
  [2] chrUn_KI270467v1           1668-3946      * | Bcell-BM4983-150120-..
  [3]             chr8 109333171-109335050      * | Bcell-BM4983-150120-..
  [4] chrUn_KI270467v1           1668-3946      * | Bcell-BM4983-150120-..
  [5]            chr21   14382415-14384230      * | Bcell-BM4983-150120-..
  [6]             chr7   12686167-12688127      * | Bcell-BM4983-150120-..
             V5          V6        V7        V8        V9       V10
      <integer> <character> <numeric> <numeric> <numeric> <integer>
  [1]     91371           .   726.767   9144.75   9137.12       272
  [2]     24480           .   240.650   2454.89   2448.07       666
  [3]     19002           .   195.491   1906.94   1900.25      1082
  [4]     17131           .   179.635   1719.67   1713.15      1445
  [5]     17084           .   179.233   1714.96   1708.46       898
  [6]     16639           .   175.420   1670.38   1663.94       525
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

Solution.

filtered <- gr %>%
  reduce_ranges %>%
  group_by_overlaps(gr) %>%
  filter(V8 == max(V8))

> filtered
GRanges object with 5 ranges and 8 metadata columns:
Groups: query [5]
              seqnames              ranges strand |                     V4
                 <Rle>           <IRanges>  <Rle> |            <character>
  [1]             chr3   93470281-93470873      * | Bcell-BM4983-150120-..
  [2] chrUn_KI270467v1           1668-3946      * | Bcell-BM4983-150120-..
  [3]             chr8 109333171-109335050      * | Bcell-BM4983-150120-..
  [4]            chr21   14382415-14384230      * | Bcell-BM4983-150120-..
  [5]             chr7   12686167-12688127      * | Bcell-BM4983-150120-..
             V5          V6        V7        V8        V9       V10     query
      <integer> <character> <numeric> <numeric> <numeric> <integer> <integer>
  [1]     91371           .   726.767   9144.75   9137.12       272         1
  [2]     24480           .   240.650   2454.89   2448.07       666         2
  [3]     19002           .   195.491   1906.94   1900.25      1082         3
  [4]     17084           .   179.233   1714.96   1708.46       898         4
  [5]     16639           .   175.420   1670.38   1663.94       525         5
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths