I would like to understand better the meaning of the indexes for both Granges and Grangeslist objects. I will post here what I think it's going on in this function and please do correct me if I'm wrong.
Let's assume two grange objects:
> g
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-113 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
c chr2 103-113 + | 3 0.777778
d chr2 150-200 * | 4 0.666667
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
and
> g2
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 111-120 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
If I use findoverlaps like that:
> findOverlaps(g,g2)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 2 2
[3] 3 2
-------
queryLength: 4 / subjectLength: 2
Here query is g and subject is g2 and the indexes(numbers) of queryhits from this Hits object which are 1 2 3 represent the index in the g object. Is that correct ? So 1 2 3 indexes in this case will be the first 3 rows of g, which are:
> g
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-113 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
c chr2 103-113 + | 3 0.777778
For the subjectHits these are the indexes of g2 which the query is hitting to, is that correct ? So the first index of g hits the first index of g2, the second hits the second and the third hits the second again.
If those statements are correct I think I got the idea of those indices in both queryHits and subjectHits columns.
Now things start to get confusing when using a grange object and grangelist, for example this grangelist:
> grl
GRangesList object of length 2:
$txA
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chr2 103-106 + | 5 0.45
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
$txB
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chr1 107-109 + | 3 0.3
[2] chr1 113-115 - | 4 0.5
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
If I use findoverlaps like that:
> findOverlaps(g,grl)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 2
[2] 2 1
[3] 3 1
-------
queryLength: 4 / subjectLength: 2
I think the queryHits are following the same logic as before, but subjectHits is representing not the row indices of the different objects in the list but rather the list indices, is that correct ? So in this case 2 1 1 means that the first query is hitting a genomic region somewhere in the 2 index of this GrangeList object. This 2 is equal to $txB in this case.
I'm thinking about the possibilities of building a matrix of hits and missing hits using the genomicRanges package but I'm not sure if this function is appropriate for my use case:
I would like to have something like that when using findoverlaps(g,grl):
queryHits txa txb
<integer> <logical> <logical>
[1] 1 0 1
[2] 2 1 0
[3] 3 1 0
[4] 4 0 0
So here in this hypothetical case we have a logical vectors for both txa and txb objects , there queryhits correspond to the indexes of g and even if there are no hits the index should be included, for example index 4 doesn't hit any regions in txa and txb and so we have 0 0. For the first index it hits txb but not txa.
Any help is much appreciated!!! thanks