[R] Granges Findoverlaps Constrain By Id
1
2
Entering edit mode
11.1 years ago
bbio ▴ 90

I have two GRanges and I want to find overlaps between them. However, I only want to find overlaps for ranges that have the same parent_id. My current approach is to use the parent_id as the seqnames (instead of the chromosome) under the assumption that findOverlaps will not find matches between different seqnames and that ranges with the same parent_id are on the same chromosome anyway. Is this a valid approach? Are there any potential problems with this? Is there a better way?

r bioconductor • 3.2k views
ADD COMMENT
1
Entering edit mode
11.1 years ago
Michael 54k

Yes, there are some potential problems. I assume that you are importing the GRanges from an annotation file in GFF format using the rtracklayer package. If you use something different, the concern is most likely also important. From http://www.sequenceontology.org/gff3.shtml:

Parent Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, an so forth. A feature may have multiple parents....

You can try this out by using the example code in the rtracklayer package, function import.gff3:

test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")

## basic import
test <- import(test_gff3)
test

When you look at the output, you will find that it is an RangedData object which can be coerced into a GRanges object by: as (test, "GRanges") if needed. The parent information is in the "Parent" column, it is of type CompressedCharacterList to accomodate multiple parent entries:

> test[,'Parent']
 RangedData with 31 rows and 1 value column across 2 spaces
          space         ranges   |                    Parent
    <factor>      <IRanges>   | <CompressedCharacterList>
1      chr10 [92828, 95504]   |                          
2      chr10 [92828, 95178]   |             GeneID:347688
3      chr10 [92828, 95504]   |             GeneID:347688
4      chr10 [92828, 94054]   |                   872,873
5      chr10 [92997, 94054]   |                   872,873

To copy these entries into the geneName column which is of type character is possibly neither necessary nor useful.

To restrict the ranges by a certain Parent ID you can do the following:

l1 = drop( test[,'Parent'])[[1]]
ind = sapply(l1,function(x, parent){parent %in% x}, 872) # filterfunction, in this example Parent := 872

test[ind,]
RangedData with 6 rows and 10 value columns across 2 spaces
     space         ranges |      source     type     score   strand     phase                     Alias    geneName          ID
  <factor>      <IRanges> |    <factor> <factor> <numeric> <factor> <integer> <CompressedCharacterList> <character> <character>
1    chr10 [92828, 94054] | rtracklayer     exon        NA        -        NA                                    NA          NA
2    chr10 [92997, 94054] | rtracklayer      CDS        NA        -        NA                                    NA          NA
3    chr10 [94555, 94665] | rtracklayer     exon        NA        -        NA                                    NA          NA
4    chr10 [94555, 94615] | rtracklayer      CDS        NA        -        NA                                    NA          NA
5    chr10 [94744, 94852] | rtracklayer     exon        NA        -        NA                                    NA          NA
6    chr10 [95348, 95504] | rtracklayer     exon        NA        -        NA                                    NA          NA
         Name                    Parent
  <character> <CompressedCharacterList>
1          NA                   872,873
2          NA                   872,873
3          NA                       872
4          NA                       872
5          NA                       872
6          NA                       872
ADD COMMENT
0
Entering edit mode

Thank you for your response. However, I already do some filtering on my data such that each range has exactly zero or one parent. Under these circumstances, do you think my method would be appropriate?

ADD REPLY
0
Entering edit mode

I don't see a major problem then, except the handling of multiplicity or missing parent_ids, and that you need to keep track of the sequence name elsewhere. Just theoretically, it should work and the application case, e.g. finding genes where exons or transcripts overlap, might be useful. To be on the safe side, I would like to refer you to the bioconductor mailing list as well: http://bioconductor.org/help/mailing-list/

ADD REPLY

Login before adding your answer.

Traffic: 3741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6