Hello!
I was recently working on a ribosomal profiling project, essential you use RNA-seq but only on the RNA attached to ribosomes at a given time. So the first thing I did was align this data to mitochondrial DNA and ribosomal DNA to filter these 2 sets of data out. Then I went ahead and aligned this data that didnt map to either of those to the mm9 annotation, from there I used homer to quantify the repeats and get my data. When I quantified and aligned I was using the mm9-UCSC gene annotation GTF file. However, this gives data from a number of alternative splicing events, so if I have Gene A, I have actually 3 sets of data for the gene called Gene A. Now these 3 sets of data in my set are actually very similar and rarely vary in total count. The PROBLEM is when I look at bioinformatics papers that report this kind of data I see Gene A, and it is only listed 1 time with the graph they got from their results no mention of the isoforms or alternate spliced exons. Like what am I missing do they average these 3 genes or are they using some sort of GTF that doesn't include these or am I losing my mind.
Also I am curious about "scoring" and if this is related to my problem, I received data for about 55,000 genes (most of which had multiple splicing events) at varying levels, however I have heard that the data from some of these may not be reliable if they are not scored correctly. How do I fix this or am I just wrong here?
Please consider using a more concise title.
Most of the time read counting is done at the exon level but the counts are summarized at gene_id level. That is why you see only one count per gene.
How can I summarize the counts at the gene ID level?
With
featureCounts
there is an option (-g gene_id
) to summarize counts at gene level.Thanks for your help friend!
Nevermind, I think I figured it out. They just report all three isoforms on one graph and call it a gene_set.
(https://xkcd.com/285/)
https://www.ncbi.nlm.nih.gov/pubmed/29346549 look at the supplemental data table!!!
BELIEVE ME