Whenever I annotate microarray probes or RNASeq reads and want to have information at the gene level, I deal with the following problem: In order to have a "clean" annotation, I don't want to consider any reads/probes that map to transcripts of more than one gene, and for the sake of "clean" statistics I don't want to consider any probes/values more than once in the analysis. To achieve this, it is essential to work with a non-redundant database. I usually use RefSeq, because this is the database that is most familiar to me.
In RefSeq, however, I found many exons (about 1500, downloaded from the UCSC refGene track) that are shared by transcripts with different gene symbols. About 500 "genes" seem to be affected by this kind of overlap.
Here are two extreme examples:
Duxbl1, Duxbl2 and Duxbl3 share all their exon junctions. Their ORFs (NP_001171009.1, NP_001171010.1, NP_899245.1) are identical.
Il11ra2 shares all its exon junctions with Gm2002 and Gm13305. Their ORFs (NP_001094066.1, NP_034680.3, NP_001092818.1) are identical.
Some of the genes that I found differ in their UTRs due to alternative transcription start sites or poly-A sites. Some of the genes also differ in their ORFs due to alternative splicing events. On the other hand, you can find many genes in RefSeq that have the same genesymbol but differ in their ORFs and UTRs (e.g. Nfkbid).
In Ensembl, only about 900 exons (downloaded from the UCSC ensGene track) are shared by transcripts with different ENSG IDs, which still affects about 500 genes. Ensembl states:
"An Ensembl gene (with a unique ENSG... ID) includes any spliced transcripts (ENST...) with overlapping coding sequence. (...) Transcript clusters with no overlapping coding sequence are annotated as separate genes."
This sounds very reasonable, but also here I found examples for inconsistencies: Palm2 (ENSMUSG00000090053) and Gm20459 (ENSMUSG00000089945) share coding exons. And what will happen when a gene has coding and non-coding isoforms?
My questions are:
(1) Would you recommend Ensembl over RefSeq to avoid/minimize problems of redundancy? (2) What is the most accepted definition of a gene when it comes to the question of information at the "gene level" for RNASeq and microaray data?
Thank you for your ideas and advice!