Why does every database have a different set of exons for TP53 (and other genes)?
3
1
Entering edit mode
4.8 years ago
mrz132435 ▴ 20

I was looking to get the start and stop positions for all exons in TP53, and of course there are a few ways to do this in R. It is generally believed that TP53 has 11 exons. And yet, each method I use to pull exon info finds a different number. Using biomaRt, I got 53 exons; using GenomicFeatures to get info from UCSC, 21 exons (see below for both examples). In fact, what I get from pulling the data from ensembl doesn't even seem to agree with what is shown on the ensembl page for TP53. In both ensembl and UCSC, incidentally, some of the reported exons overlap with one another.

So, first of all, why is there so much disagreement over how many/which exons exist in TP53? And secondly, how would I get information (mainly start and stop sites) for the "consensus" exons (in the case of TP53, I guess that'd be the e1 through e11 that most biologists believe in)? Thanks.

Exon info from ensembl:

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
gb <- getBM(attributes=c('ensembl_exon_id',"exon_chrom_start","exon_chrom_end"),
        filters ="hgnc_symbol", values="TP53", mart=ensembl,bmHeader=TRUE)
nrow(gb)

From UCSC:

library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
genome <- TxDb.Hsapiens.UCSC.hg19.knownGene
tp53 = genes(genome)[which(genes(genome)$gene_id == 7157),]
tp53_exons = subsetByOverlaps(exons(genome), tp53)
nrow(as.data.frame(tp53_exons))
gene exons ensembl R • 1.4k views
ADD COMMENT
1
Entering edit mode
4.8 years ago
igor 13k

TP53 has multiple transcripts or splice variants. You can visualize that info in Ensembl, for example. Each one has multiple exons and those may not be identical across every transcript.

What you are probably really looking for is the canonical transcript, but that's not an easy question to answer. There are some previous discussions on that topic:

ADD COMMENT
1
Entering edit mode
4.8 years ago

In short, it's because your method is grabbing exons for multiple transcripts. Try:

library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
gb <- getBM(attributes=c('ensembl_gene_id', 'ensembl_transcript_id','ensembl_exon_id',"exon_chrom_start","exon_chrom_end"),
        filters ="hgnc_symbol", values="TP53", mart=ensembl, bmHeader=TRUE)
nrow(gb)

As for snagging the "canonical" transcript, you have a few options. This post does a good job of explaining what canonical really means for the different annotations.

ADD COMMENT
1
Entering edit mode
4.8 years ago

Others have dealt with why there are so many exons. The reason that every database reports different exons is because, even for a well studied gene like p53, we don't really know for sure what all of its exons are or what combinations they can come together in to make isoforms.

Different databases report different subsets of all the different sequences that have ever been seen that align to the general p53 region - each database has a different set of criteria for what makes it in to the database.

In fact, far from the most studied genes being the most settled, it seems that the harder you look, the more you find, so probably the best studied genes have the most disagreement between databases.

ADD COMMENT

Login before adding your answer.

Traffic: 2253 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6