Question: What Is A "Gene"? Redundancies In Refseq Versus Ensembl
gravatar for Johanna Schott
7.6 years ago by
Johanna Schott390 wrote:

Whenever I annotate microarray probes or RNASeq reads and want to have information at the gene level, I deal with the following problem: In order to have a "clean" annotation, I don't want to consider any reads/probes that map to transcripts of more than one gene, and for the sake of "clean" statistics I don't want to consider any probes/values more than once in the analysis. To achieve this, it is essential to work with a non-redundant database. I usually use RefSeq, because this is the database that is most familiar to me.

In RefSeq, however, I found many exons (about 1500, downloaded from the UCSC refGene track) that are shared by transcripts with different gene symbols. About 500 "genes" seem to be affected by this kind of overlap.

Here are two extreme examples:

Duxbl1, Duxbl2 and Duxbl3 share all their exon junctions. Their ORFs (NP_001171009.1, NP_001171010.1, NP_899245.1) are identical.

Il11ra2 shares all its exon junctions with Gm2002 and Gm13305. Their ORFs (NP_001094066.1, NP_034680.3, NP_001092818.1) are identical.

Some of the genes that I found differ in their UTRs due to alternative transcription start sites or poly-A sites. Some of the genes also differ in their ORFs due to alternative splicing events. On the other hand, you can find many genes in RefSeq that have the same genesymbol but differ in their ORFs and UTRs (e.g. Nfkbid).

In Ensembl, only about 900 exons (downloaded from the UCSC ensGene track) are shared by transcripts with different ENSG IDs, which still affects about 500 genes. Ensembl states:

"An Ensembl gene (with a unique ENSG... ID) includes any spliced transcripts (ENST...) with overlapping coding sequence. (...) Transcript clusters with no overlapping coding sequence are annotated as separate genes."

This sounds very reasonable, but also here I found examples for inconsistencies: Palm2 (ENSMUSG00000090053) and Gm20459 (ENSMUSG00000089945) share coding exons. And what will happen when a gene has coding and non-coding isoforms?

My questions are:

(1) Would you recommend Ensembl over RefSeq to avoid/minimize problems of redundancy? (2) What is the most accepted definition of a gene when it comes to the question of information at the "gene level" for RNASeq and microaray data?

Thank you for your ideas and advice!


ADD COMMENTlink modified 3.7 years ago by Biostar ♦♦ 20 • written 7.6 years ago by Johanna Schott390
gravatar for Emily_Ensembl
7.5 years ago by
Emily_Ensembl21k wrote:

Hi Johanna

In the majority of cases, there shouldn't be different genes which contain the same exons in Ensembl. In some cases, we have reasons to manually split genes in two – you're right that the documentation doesn't make this clear.

If you have any more examples of this, we'd be grateful if you could email a list to Then we can either let you know why these genes have been purposely separated, or, if this has occurred in error, investigate why and fix it.


ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Emily_Ensembl21k

Thank you, the helpdesk is a good idea. I will post the answer as soon as I have an explanation.

ADD REPLYlink written 7.5 years ago by Johanna Schott390

(Also sent from Ensembl helpdesk)

We haven't gone through every one of your 500, but we had a look at a few examples and found the same pattern, which should explain what's going on.

What's happening here is read-through transcripts. Let's look at the first example on your list: chr10:127759937-127760250(+) ENSMUST00000128247 ENSMUST00000073639 ENSMUSG00000056148 ENSMUSG00000089789

Here's the region of these genes:

Here you can see what look like two distinct genes, Rdh1 (ENSMUSG00000089789) and Rdh9 (ENSMUSG00000056148). Spanning both of those genes is the transcript ENSMUST00000128247. This appears to share a 5' end with Rdh1 and the 3' end with Rdh9.

This transcript has been manually annotated by the Havana project based on biological evidence of its existence. The evidence can be seen here:;g=ENSMUSG00000056148;r=10:127759787-127792694;t=ENSMUST00000128247

However its existence does not mean that Rdh1 and Rdh9 are the same gene. Biologically, we still believe that they are two distinct genes, just transcription occasionally spans these two genes, known as readthrough transcription. For this reason we manually split these into two separate genes (although possibly we need three genes: the 5' gene, the 3' gene and the readthrough).

I agree that it could be clearer in our documentation that this is the case, and we will look into making it more explicit.

Here's the documentation from Havana, who do the manual annotation, on defining genes. There's a section on readthroughs on page 37.

In terms of non-coding overlaps, what we mean is that if just a little bit of the UTRs overlap, we won't consider them to be the same gene. For non-coding transcripts, what we're looking it is full exon sharing, rather than just sharing a small part. Our completely non-coding genes are manually curated (see page 18 of the Havana document above) which means that there's a bit of common sense that goes into it.

In terms of your microarray analysis, one way might be to compare to transcripts rather than genes. This way you can trace your transcripts back into gene, and this will differentiate it bit more.

Get back to me if you have any more questions about this.

ADD REPLYlink written 7.5 years ago by Emily_Ensembl21k

Thank you for this quick and helpful reply to my questions! I was not aware of the read-through problem, and it seems that quite some examples on my list can be explained that way. Another category seems to be mRNAs that have an overlap with targets of nonsense-mediated decay. For example: Lrch4 has isoforms that are classified as protein-coding, e.g. ENSMUST00000031734, and isoforms that are targets of nonsense-mediated decay (NMD), e.g. ENSMUST00000177477. Both have the same ENSMUSG... ID (ENSMUSG00000093445). In addition, Lrch4 shares most of its protein coding exons with Gm20605, that is also a target of NMD, but has a different ENSMUSG... ID. In these cases, I don't see the rule yet.

For everybody else who might read this, I still have to clarify: Not all of the roughly 900 exons that I mentioned in my initial post are in conflict with the apparently a bit simplified statement of Ensembl. Only about 400 exons are shared between coding regions of different genes in the mouse genome. These are the cases Emily is talking about in her answer.

ADD REPLYlink written 7.5 years ago by Johanna Schott390

I've sent your list onto our genebuild team, so hopefully we can get some answers.

ADD REPLYlink written 7.5 years ago by Emily_Ensembl21k
gravatar for DG
7.6 years ago by
DG7.2k wrote:

As long as you are consistent I think you should be fine with either choice. I tend to like Ensembl better because, at least for humans, it is a larger dataset compared to RefSeq (I believe), or especially when compared to the CCDS. It often gives me more that I have to sift through, but I prefer that over missing something. With the overlapping gene level, for RNASeq at least I wouldn't worry too much. The short-read mappers and analysis programs for differential expression shouldn't be too bothered by it. Microarray data may be a different story though.

ADD COMMENTlink written 7.6 years ago by DG7.2k

Thank you for your answer. Perhaps I am a bit over-precise. The dilemma is: I have to accept that reads are assigned to several genes (because of overlaps between genes), although I am not sure which gene they actually come from (especially but not only in the case of a library that was not strand-specific). In addition, these reads would have a higher impact on the analysis than a read that is considered only once. Or I have to discard all reads that are assigned to more than one gene, but then I lose all the reads of genes that are redundant in the database (as for Duxbl1, Duxbl2 and Duxbl3 in my example), because it looks like they come from different genes, but actually they don't. This does not bother most analysis tools, but it bothers me:)

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Johanna Schott390

Given high dynamic of biological annotation data (see revision history for the genes that are of interest to you) I think you worry too much. As for definition of a gene (and problems with such a definition), see this paper by Graziano Pesole:

ADD REPLYlink written 7.5 years ago by Pawel Szczesny3.2k

Thanks for pointing this paper out!

We have one project in the lab where we are really looking for the needle in the haystack, the ONE gene that explains a certain phenotype that we observe. That's why I am so scared of missing something. Probably there is no perfect solution to my problem... I just thought databases should be consistent, no matter how they define a gene.

ADD REPLYlink written 7.5 years ago by Johanna Schott390

I do gene discovery research in rare Mendelian disorders, so I know the feeling of not wanting to miss something. Which is exactly why you don't want to throw things out. Reads that are going to map to multiple transcripts will basically even out. If you are looking for differential expression it is the unique exons that count the most anyway. So there are a few issues. One is biological reality, and you can;t do much about that, the other is a data issue. In cases where the databases are wrong or need revising all we can do is use what we think is the best database/reference and be consistent in our choice.

ADD REPLYlink written 7.5 years ago by DG7.2k

... and, as Emily_Ensembl already indicated, the other thing you can do is report these cases back to the database in question, so they can be corrected .... ;)

ADD REPLYlink written 7.5 years ago by Bert Overduin3.7k

That reads mapping to multiple genes even each other out is part of the problem, isn't it? Imagine two genes that overlap by 70%. One is highly upregulated, the other one is highly downregulated. Only the unique exons will show me that strong difference. If I use all the reads, both genes will look like they change only weakly. I would love to show all analyses both ways (allowing multiple assignments and not), but that does not look good in a paper or my PhD thesis... Because bioinformatics is still rather new to me, I am always very happy about any feedback that I get from more experienced people like you, to make sure that I go after the best possible solution.

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Johanna Schott390

The Cufflinks suite explicitly takes this in to account and does differential expression analysis for a variety of biological entities: CDS, Isoforms, Genes, Transcription Start Sites, etc. You can also use DEXSeq, which does differential exon usage. You can then use that information to drill down in to more complicated cases. It is perfectly normal to do multiple analyses and combine them together in to a coherent picture. In fact, I would argue it is what you SHOULD do in bioinformatics.

ADD REPLYlink written 7.5 years ago by DG7.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1921 users visited in the last hour