Generic Bioconductor Object To Integrate Different Genomics Data Formats/Platforms
1
6
Entering edit mode
11.5 years ago
Irsan ★ 7.8k

As a young bio-informatician I start to notice that, although genomics data comes in a inmensely wide variety of formats, when you have successfully pre-processed the data, most of it boils down to data about genomic locations.

For example, when you consider a copy number analysis study on tumor samples of breast cancer patients you could think of data from aCGH, SNP-arrays or WG-seq data from different platforms like Affymetrix, Nimblegen, Illumina and ABI Solid. After processing the data, irrespective of the platform, you end up with Log R ratios on genomic positions, allowing you to integrate the various platforms.

My question is what is the commonly used and community accepted (preferably) Bioconductor data object that can store genomic positional data and that is accepted by many dowstream analysis programs? For example, for the copy number analysis study, you would like to put all the Log R ratios on genomic positions across the genome of all the tumor samples obtained from different platform in one data object. Then, you would like to make a karyogram of the LRR values of each sample in heatmap/dotplot. Also you would like to make a frequency plot of the segmentations of all samples to identify common CNV segments. And do hierarchical clustering on the segmentations of all samples to identify subgroups in the patient cohort.

I have noticed that the GRanges object from GenomicRanges might serve as a dynamic container for data on genomic positions. For example, the package ggbio accepts a GRanges object to draw a karyograms. But I am unaware of other packages that support the GRanges/RangedData objects. (So is this the thing I am looking for??)

Ideally I would like to find a central Bioconductor object (is that GRanges???) that flexibly stores data on genomic positions (copy number, B allele frequency, methylation, transcription factor binding sites, expression, etc.) and is supported by many downstream analysis packages that do

  • ideogram/karyogram visualization (maybe even along with expression or methylation data)
  • hierarchical clustering
  • finding overlaps of genomic positions between samples / make frequency plots
  • perform gene set enrichment analysis
bioconductor • 3.3k views
ADD COMMENT
3
Entering edit mode
11.5 years ago

The GenomicRanges packages is the right place to look.

There is also some effort to make the SummarizedExperiment class in the GenomicRanges package to be this "universal container of assay data over genomic ranges," but I'm not sure that you'll find any one data structure that plugs into all the analyses you are after as all of this is relatively new.

In addition to the ggbio package for visualization, you might want to explore if the Gviz package has any functionality you would find helpful.

ADD COMMENT
0
Entering edit mode

Hi steve, thanks for your suggestion about SummarizedExperiment-class and the Gvix package, I will look into it. BTW, I just saw you can also do CBS-segmentation on GenomicRanges-objects with a new BioC-package called "fastseg"

ADD REPLY
0
Entering edit mode

For CGH data, you might also check a genoset.

ADD REPLY

Login before adding your answer.

Traffic: 2889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6