Exons Associated With Multiple "Genes" In Ucsc Genome Data
9.3 years ago
Max ▴ 140

When I retrieve a complete exon list from the UCSC Genome refseq genes (human reference genome), I often find a single exon (same coordinates defined by chromosome number and start/end nucleotides) listed multiple times, i.e. with multiple genes defined by NM_ numbers. Is this to be expected? I realize that are instances of exons that are shared across multiple "genes," but there seem to be far too many instances of this in the sequence list and data tables to be due to actual shared exons alone.

9.3 years ago
Geparada ★ 1.5k

Hi Max,

First of all the "NM_xxx" are transcript annotations, not genes. One gene could have multiples annotated transcripts (isoforms) due to alternative splicing of gene. So, yes, is totally expected that if you extract the exon from any mammalian transcript annotation (RefSeq, USCS Genes, Gencode), you will have exon listed multiple times.

Now, the number of times that every exon is listed MUST be equal to the number of transcripts that have this exon. I recommend to you look a particular exon and count the number of times that is listed and the transcripts that contain this exon (you can do it just viewing the genome browser at the exon coordinates). If the number don't match, it's mean that your method for extract exons for transcript annotations have a bug.

Thanks. I'll have to check to see if there's a match with the number of transcripts or not.


