[GRanges] How to append query metadata to subjectHits
1
1
Entering edit mode
3.4 years ago
Alewa ▴ 170

Hi all,

I'm I found overlaps between couple of genes and TCGA copy number calls but I've been unsuccessful in appendding metadata (hgnc_symbol) to subjectHits

data look like this; 1. sample of genes of interest

> my.cordinates.hg19.gr
GRanges object with 3 ranges and 3 metadata columns:
      seqnames              ranges strand | hgnc_symbol        band   gene_biotype
         <Rle>           <IRanges>  <Rle> | <character> <character>    <character>
  [1]        8 128747680-128753674      * |         MYC      q24.21 protein_coding
  [2]        5     1253262-1295184      * |        TERT      p15.33 protein_coding
  [3]       17     7565097-7590856      * |        TP53       p13.1 protein_coding
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

TCGA absolute copy calls in Granges;

  > Granges.merged.df.tcga.mastercalls.abs.BRCA.ID.ethnicities
GRanges object with 159857 ranges and 21 metadata columns:
           seqnames            ranges strand |             Sample Num_Probes    Length Modal_HSCN_1 Modal_HSCN_2 Modal_Total_CN Subclonal_HSCN_a1 Subclonal_HSCN_a2 Cancer_cell_frac_a1 Ccf_ci95_low_a1
              <Rle>         <IRanges>  <Rle> |        <character>  <numeric> <numeric>    <numeric>    <numeric>      <numeric>         <numeric>         <numeric>           <numeric>       <numeric>
       [1]        1    564621-1510801      * | TCGA-3C-AAAU_WHITE         53    946180            1            2              3                 0                 1                0.25         0.08383
       [2]        1  1688192-16142960      * | TCGA-3C-AAAU_WHITE       4180  14454768            1            2              3                 0                 1                0.06         0.02732
       [3]        1 16165661-63472103      * | TCGA-3C-AAAU_WHITE      13591  47306442            1            2              3                 0                 1                0.04         0.00615
       [4]        1 63472868-72759524      * | TCGA-3C-AAAU_WHITE       3196   9286656            1            2              3                 0                 0                0.03         0.00000
       [5]        1 72811904-85632596      * | TCGA-3C-AAAU_WHITE       4069  12820692            1            2              3                 0                 0                0.04         0.00292
       ...      ...               ...    ... .                ...        ...       ...          ...          ...            ...               ...               ...                 ...             ...
  [159853]       21 28284150-39907857      * | TCGA-Z7-A8R6_WHITE       4229  11623707            1            1              2                 0                 0                0.00         0.00000
  [159854]       21 39908195-48084820      * | TCGA-Z7-A8R6_WHITE       3214   8176625            1            1              2                 0                 0                0.05         0.01384
  [159855]       22 16055207-24329711      * | TCGA-Z7-A8R6_WHITE       1878   8274504            1            1              2                 0                 0                0.05         0.00946
  [159856]       22 24402321-44489022      * | TCGA-Z7-A8R6_WHITE       6479  20086701            1            1              2                 0                 0                0.04         0.00644
  [159857]       22 44493823-51199978      * | TCGA-Z7-A8R6_WHITE       2807   6706155            1            1              2                 0                 0                0.06         0.02714
           Ccf_ci95_high_a1 Cancer_cell_frac_a2 Ccf_ci95_low_a2 Ccf_ci95_high_a2       LOH Homozygous_deletion    solution        Type        race              ethnicity
                  <numeric>           <numeric>       <numeric>        <numeric> <numeric>           <numeric> <character> <character> <character>            <character>
       [1]          0.39979                0.74         0.58132          0.89771         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
       [2]          0.08989                0.85         0.81351          0.87773         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
       [3]          0.06643                0.91         0.87122          0.92977         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
       [4]          0.05842                0.03         0.00000          0.05981         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
       [5]          0.06738                0.06         0.02157          0.08639         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
       ...              ...                 ...             ...              ...       ...                 ...         ...         ...         ...                    ...
  [159853]          0.02834                0.00         0.00000          0.02834         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
  [159854]          0.06983                0.01         0.00000          0.03265         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
  [159855]          0.07308                0.05         0.00946          0.07308         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
  [159856]          0.05948                0.04         0.00644          0.05948         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
  [159857]          0.08524                0.06         0.02714          0.08524         0                   0         new        BRCA       WHITE NOT HISPANIC OR LATINO
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths
>

find overlaps between TCGA asbolute copy calls and ancestry- related genes

overlaps.tcga.ancestry.genes <- findOverlaps(query=my.cordinates.hg19.gr, subject=Granges.merged.df.tcga.mastercalls.abs.BRCA.ID.ethnicities, type = "within") ##complete overlaps

create empty column to append gene names for easy filtering for downstream analysis

Granges.merged.df.tcga.mastercalls.abs.BRCA.ID.ethnicities$Ancestry_genes <- NA

> str(Granges.merged.df.tcga.mastercalls.abs.BRCA.ID.ethnicities$Ancestry_genes) ##sanity checks
 logi [1:159857] NA NA NA NA NA NA ...

extract indexes of subject hits

sub_hits.overlaps <- as.data.frame(Granges.merged.df.tcga.mastercalls.abs.BRCA.ID.ethnicities[subjectHits(overlaps.tcga.ancestry.genes)])
str(sub_hits.overlaps$Ancestry_genes) ##sanity checks

append genes names to corresponding overlap indexes in subject hits

Granges.merged.df.tcga.mastercalls.abs.BRCA.ID.ethnicities[queryHits(overlaps.tcga.ancestry.genes)]$Ancestry_genes <- sub_hits.overlaps[,grepl("hgnc_symbol",names(sub_hits.overlaps))]
Error in do.call(`[<-`, args) : replacement has length zero

Thanks for your help

R granges bioconductor findOverlaps • 4.3k views
ADD COMMENT
3
Entering edit mode
3.4 years ago
Papyrus ★ 2.9k

I think it is better if you use the metadata columns in the object, with the mcols function. This function lets you access the DataFrame containing the metadata inside of the GRanges. For example, these 2 GRanges:

> library(GenomicRanges)
> gr <- GRanges(
+   seqnames = c("chr3", "chr4", "chr2"),
+   ranges = IRanges(start = c(101,103,102), end = c(111,113,112), names = head(letters, 3)),
+   hgnc_symbol = c("MYC","TERT","TP53"))
> gr
GRanges object with 3 ranges and 1 metadata column:
    seqnames    ranges strand | hgnc_symbol
       <Rle> <IRanges>  <Rle> | <character>
  a     chr3   101-111      * |         MYC
  b     chr4   103-113      * |        TERT
  c     chr2   102-112      * |        TP53
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> tcga.calls <- GRanges(
+   seqnames = c("chr2", "chr4", "chr3"),
+   ranges = IRanges(start = c(102,70,500), end = c(105,105,510), names = head(letters, 3)),
+   metadata = c("a","b","c"))
> tcga.calls
GRanges object with 3 ranges and 1 metadata column:
    seqnames    ranges strand |    metadata
       <Rle> <IRanges>  <Rle> | <character>
  a     chr2   102-105      * |           a
  b     chr4    70-105      * |           b
  c     chr3   500-510      * |           c
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

You find the overlaps, and you can check their positions:

> overlaps <- findOverlaps(query = gr, subject = tcga.calls)
> queryHits(overlaps)
[1] 2 3
> subjectHits(overlaps)
[1] 2 1

Now create your new metadata column in the subject:

> mcols(tcga.calls)$Ancestry_genes <- NA
> tcga.calls
GRanges object with 3 ranges and 2 metadata columns:
    seqnames    ranges strand |    metadata Ancestry_genes
       <Rle> <IRanges>  <Rle> | <character>      <logical>
  a     chr2   102-105      * |           a           <NA>
  b     chr4    70-105      * |           b           <NA>
  c     chr3   500-510      * |           c           <NA>
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

And finally assign the matched values:

> mcols(tcga.calls)$Ancestry_genes[subjectHits(overlaps)] <- mcols(gr)$hgnc_symbol[queryHits(overlaps)]
> 
> tcga.calls
GRanges object with 3 ranges and 2 metadata columns:
    seqnames    ranges strand |    metadata Ancestry_genes
       <Rle> <IRanges>  <Rle> | <character>    <character>
  a     chr2   102-105      * |           a           TP53
  b     chr4    70-105      * |           b           TERT
  c     chr3   500-510      * |           c           <NA>
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

Thanks! this was helpful!

ADD REPLY

Login before adding your answer.

Traffic: 1380 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6