Question: Problem With Iranges And Findoverlap
2
gravatar for e.karasmani
7.2 years ago by
e.karasmani120
e.karasmani120 wrote:

Dear All,

I have a small problem with Find overlaps function of IRange. I am trying to make the overlap between my peaks and the intron dataset that I have downloaded from UCSC.

So here is what I am doing....

peak.file =read.table ("/data/.chip=20.txt", header=T, sep='\t')
pr <- RangedData(IRanges(start=peak.file$start, end=peak.file$end), space=peak.file$chromosome,  idx=peak.file$target.gene.name), idx=1:nrow(peak.file)

pr

RangedData with 52445 rows and 1 value column across 21 spaces
         space                 ranges   |        idx
      < factor >              < IRanges >   |   < factor >
1         chr1 [  9933699,   9934385]   |    Snord87
2         chr1 [ 88255056,  88257357]   |        Ncl

then I am doing the same for my introns

intron.file =read.table ("/data//mm9.intron.txt", header=T, sep='\t')
intron <- RangedData(IRanges(start=intron.file$start, end=intron.file$end), space=intron.file$chr, idx=intron.file$name)
intron

intron

RangedData with 254240 rows and 1 value column across 32 spaces
             space                 ranges   |   idx
           < factor>              < IRanges>   |  <factor>
1             chr1 [134221650, 134222782]   |  NM_001195025_intron_1_0_chr1_134221651_ F     
2             chr1 [134222806, 134224273]   | NM_001195025_intron_2_0_chr1_134222807_f

And then I am doing that

> ol=findOverlaps(pr, intron, maxgap=0, minoverlap=250, type = "equal", select = "first")
> ol
CompressedIntegerList of length 21
[["chr1"]]  NA NA N 11538 NA NA ... NA NA NA NA NA<NA>
[["chr10"]] NA NA NA NA NA NA... 4431 NA 4960 NA NA<NA>
[["chr11"]] 1096 NA NA NA NA 9556 ... 6717 6719 NA NA NA 6722
[["chr12"]] 5425 NA 1884 NA NA NA 551 ... NA NA 647 NA NA  NA
[["chr13"]] NA NA N A NA NA NA... NA NA NA NA NA NA
[["chr14"]] NA NA NA NA NA NA... NA NA NA NA 5550  NA
[["chr15"]] NA NA 396 NA NA NA NA... NA NA NA NA NA 2318
[["chr16"]] NA 882 278 NA NA NA NA... NA 3985 1360 NA NA<NA>
[["chr17"]] 4365 NA NA NA NA NA... NA NA NA NA NA <NA>
[["chr18"]] 742 NA NA NA NA NA NA... NA NA NA 2836 NA<NA>

Where you see [HTML] is NA

 > as.matrix(ol)
   Error in as.vector(data) :
   no method for coercing this S4 class to a vector

What am I doing wrong??? Is this what I should I expect????

Thank you in advance

I will wait for your answer

Best regards Lena

R peak-calling chip-seq • 4.2k views
ADD COMMENTlink modified 7.2 years ago by Michael Dondrup46k • written 7.2 years ago by e.karasmani120

you'll have to reformat the post as most of these elements are just [HTML].

ADD REPLYlink written 7.2 years ago by Arun2.3k

that's a bug in the biostar renderer I believe.

ADD REPLYlink written 7.2 years ago by Michael Dondrup46k

This seems to be more of an R and package problem. It would be helpful to know what version of IRanges, bioconductor and R you are using.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Arun2.3k

sorry for the late answer but I was sick....here it is

 sessionInfo ()
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] GenomicRanges_1.6.7 IRanges_1.12.6
ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by e.karasmani120
1
gravatar for Michael Dondrup
7.2 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I can't find anything wrong here. The < NA > comes from when there is no overlap found. If you are surprised not to find more overlaps, then you should adjust your overlap parameters. Asking for identical ranges with only +-250 bp tolerance is the problem. Remember that your peaks if they come from ChIP-seq experiments represent a location of a protein binding that is not very likely to span a whole intron. I suggest you use type='any' instead, in a ChIP-seq the accuracy is not high enough to justify to be more restrictive. If you want a matrix, you have to use select='all'. and call as.matrix() you can most likely ignore multiple overlaps, because there won't be many overlapping introns.

Look at this simple example to see that all is fine:

example(RangedData) # to get some data, ignore the output
rd
RangedData with 3 rows and 1 value column across 2 spaces
     space    ranges |     score
  < factor> < IRanges> | < integer>
1        1    [1, 4] |        10
2        1    [3, 6] |        NA
3        2    [2, 5] |         2
rd2
RangedData with 3 rows and 1 value column across 1 space
 space    ranges |     score
 < factor> <  IRanges> | < integer>

1        1    [1, 4] |        10
2        1    [2, 5] |         2
3        1    [3, 6] |        NA

ol=findOverlaps(rd, rd2, maxgap=0, minoverlap=1, type = "equal", select = "first")
ol
CompressedIntegerList of length 2
[["1"]] 1 3
[["2"]] < NA >
ol=findOverlaps(rd, rd2, maxgap=0, minoverlap=1, type = "any", select = "all")
ol
RangesMatchingList of length 2
names(2): 1 2
olm = as.matrix(ol)
olm
     query subject
[1,]     1       1
[2,]     1       2
[3,]     1       3
[4,]     2       1
[5,]     2       2
[6,]     2       3

 ## if you want to get rid of multiple matches you can do it now:
 olm[!duplicated(olm[,1]),]
 query subject
 [1,]     1       1
 [2,]     2       1
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Michael Dondrup46k

Hi Michael, I thought the problem was this error part? Error in as.vector(data) : no method for coercing this S4 class to a vector

I remember IRanges to previously use matchMatrix to get the overlapping matrix from findOverlaps. And in the new version of IRanges, its now as.matrix. However, if there is even no overlaps (which is not in her case), you should be able to obtain the matrix with as.matrix() (even if its NULL). I had some issues with bioconductor 2.10 and R version < 2.15. I wonder if this is the case.

ADD REPLYlink written 7.2 years ago by Arun2.3k

I don't remember if there had been a change recently. I am using 2.14.1. In this version, I cannot call as.matrix on a compressedIntergerList with the same error message, but when I use select="all" I get a RangesMatchingList which can be converted into a matrix. I will extend my example.

ADD REPLYlink written 7.2 years ago by Michael Dondrup46k

thank you for your answer. I will try your suggestions with "type=any" However could you please explain to me what exactly maxgap and minoverlap do? For example I have a peak

     space          ranges   |       
      < factor >  < IRanges >   |   < factor >
1         chr1 [  1000,  5000]   |    Snord87
2         chr1 [ 7000,  10000]   |        Ncl

and I am comparing that to introns

     space            ranges   |        idx
      < factor >   < IRanges >   |   < factor >
1         chr1 [  5000,   5500]   |   intron1
2         chr1 [ 9000,  12000]   |   intron2

so i am saying

ol=findOverlaps(peak, intron, maxgap=......, minoverlap=....., type = "equal", select = "first")

I say equal because I would like to count from the middle of the peak and first because I want to have the overlap of my peaks with at least one intron.

So my problem is.....what maxgap and minoverlap should I use?

If I understand correctly.....maxgap means the distance between the two peaks????

minoverlap means how much at least these peaks should be covered? for example for the second peak if i say minoverlap 1000 then i will find it back as an overlap.....but if i say minoverlap 2000 then I will not have this peak????

are my assumptions correct?

thank you in advance

best regards Lena

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by e.karasmani120

The choice of these parameters depends on what you are actually trying to achieve and your data. Could you maybe try to define what you are looking for, an example: "Give me all peaks overlapping with an intron by at least 250bp (best to explain why you choose this number) AND also overlaps the intron midpoint". Try to specify these constraints carefully and try to motivate restrictions from biology. If you cannot find a good reason to restrict the overlaps, I would simply relax contraints to "Find any peak not further away than n bases from at least 1 intron".

ADD REPLYlink written 7.2 years ago by Michael Dondrup46k

Regarding maxgap, minoverlap, did you read the documentation ?findOverlaps? What exactly is unclear? maxgap means what is the maximum gap between a peak and an intron in your case, not between two peaks.

ADD REPLYlink written 7.2 years ago by Michael Dondrup46k

Dear Michael, thanks for the answer. I have read the documentation but the minoverlap function is not clear to me. another thing that i thought is if i select type equal (which means that I measure from the center of my peak) then the function looks for the overlap from that position based on maxgap and minoverlap parameters......so in that scenario i guess i should have higher values from maxgap and minoverlap than if i was trying to find overlap with type start or any.....is that correct?

ADD REPLYlink written 7.2 years ago by e.karasmani120

You are right, in case type="equal" then maxgap gets a different meaning. But that has nothing to do with the center of the intervals. If type is "equal" then start AND end coordinates of the interval must be identical at both, start and end, given maxgap=0. Example:

[start, end]

 I1= [100,200],    I2=[100,200],  I3=[110,210]

 ... type="equals", maxgap=0, minoverlap=1 ...
 # I1 overlaps I2 but not I3 (because I1 != I3)
 ... type="equals", maxgap=10, minoverlap=1 ...
 # I1 ov. I2 and I1 overlaps I3, because I3 ==  I1 (+- 10bp) 
... type="equals", maxgap=10, minoverlap=100 ...
 # I1 ov. I2 but not I3 because length of overlap is only 90
 ... type="equals", maxgap=10, minoverlap=90 ...
 # I1 ov. I2 and I1 ov. I3 because length of overlap is 90

If you want to look at midpoints of the intervals instead, then you should calculate them specifically.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1779 users visited in the last hour