Briefly, my question is, "how can I covert a (samples x feature) table of counts by EXON to a table of counts by GENE or TRANSCRIPT (Ideally a ‘Summarized Experiment’ in R)? I am new to RNA-seq and bioinformatics generally. I have inherited a standard 'samples by features' table of raw RNA-seq counts from a colleague, and would like to perform a straightforward differential gene expression (DEG) analysis with this file, e.g. w/DESeq2 or EdgeR, which I have done previously starting from BAM files.
However, the dataset I inherited and am inquiring about is an excel file, representing an exonic 'table of counts', in which > 500 TCGA sample IDs are arrayed in columns, and ~240k exons are arrayed in rows (each row contains GRanges info for exonic feature it represents: chr; hg19 coordinate start; stop; strand). Raw counts fill the table. My question is how should I convert this 'Table of EXON counts' into a 'Table of Gene Counts' (or Transcript)? I do not have access to the original BAM files. There are also overlapping exons (pls see example).
My guess is the pros would recommend I treat each exon as a ‘feature’ and perform a ‘differential exon expression’ analysis. Concerns w/suggestions of a ‘differential exon expression’ analysis are: (a) most R packages require BAMs as input (b) not sure I have the computational or mental bandwidth for DEG of 240k exons and 500 samples. Differential gene/transcript expression is really my goal. I use R for most projects, but can run command-line stuff, bedtools etc. I am totally unversed in Python. I have banged around in the ‘TCGA2STAT” R package to get raw counts by gene for my research question, but there are only 162 samples available, and the boss would prefer analysis for the 500 samples in the behemoth exonic counts file I inherited. As always, your help (and patience w/the green people) is appreciated.