Question

Why does every database have a different set of exons for TP53 (and other genes)?

1

Entering edit mode

4.8 years ago

mrz132435 ▴ 20

I was looking to get the start and stop positions for all exons in TP53, and of course there are a few ways to do this in R. It is generally believed that TP53 has 11 exons. And yet, each method I use to pull exon info finds a different number. Using biomaRt, I got 53 exons; using GenomicFeatures to get info from UCSC, 21 exons (see below for both examples). In fact, what I get from pulling the data from ensembl doesn't even seem to agree with what is shown on the ensembl page for TP53. In both ensembl and UCSC, incidentally, some of the reported exons overlap with one another.

So, first of all, why is there so much disagreement over how many/which exons exist in TP53? And secondly, how would I get information (mainly start and stop sites) for the "consensus" exons (in the case of TP53, I guess that'd be the e1 through e11 that most biologists believe in)? Thanks.

Exon info from ensembl:

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
gb <- getBM(attributes=c('ensembl_exon_id',"exon_chrom_start","exon_chrom_end"),
        filters ="hgnc_symbol", values="TP53", mart=ensembl,bmHeader=TRUE)
nrow(gb)

From UCSC:

library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
genome <- TxDb.Hsapiens.UCSC.hg19.knownGene
tp53 = genes(genome)[which(genes(genome)$gene_id == 7157),]
tp53_exons = subsetByOverlaps(exons(genome), tp53)
nrow(as.data.frame(tp53_exons))

gene exons ensembl R • 1.4k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 4.8 years ago by mrz132435 ▴ 20

score 1 · Answer 1 · 2019-08-07

TP53 has multiple transcripts or splice variants. You can visualize that info in Ensembl, for example. Each one has multiple exons and those may not be identical across every transcript.

What you are probably really looking for is the canonical transcript, but that's not an easy question to answer. There are some previous discussions on that topic:

How to tell which transcript is the canonical transcript?
Why the list of genes in UCSC "knownGene" table is strikingly different than the list of genes in UCSC "known canonical" table?
How to get known canonical transcript information from UCSC for a specific gencode version
How does VEP decide on canonical transcripts and is there a list?

score 1 · Answer 2 · 2019-08-07

In short, it's because your method is grabbing exons for multiple transcripts. Try:

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
gb <- getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id','ensembl_exon_id',"exon_chrom_start","exon_chrom_end"),
        filters ="hgnc_symbol", values="TP53", mart=ensembl, bmHeader=TRUE)
nrow(gb)

As for snagging the "canonical" transcript, you have a few options. This post does a good job of explaining what canonical really means for the different annotations.

score 1 · Answer 3 · 2019-08-08

Others have dealt with why there are so many exons. The reason that every database reports different exons is because, even for a well studied gene like p53, we don't really know for sure what all of its exons are or what combinations they can come together in to make isoforms.

Different databases report different subsets of all the different sequences that have ever been seen that align to the general p53 region - each database has a different set of criteria for what makes it in to the database.

In fact, far from the most studied genes being the most settled, it seems that the harder you look, the more you find, so probably the best studied genes have the most disagreement between databases.