Exon-Start Same As Exon-End From Ucsc Knowngene
2
3
Entering edit mode
9.9 years ago
brentp 23k

As an example, a single row from UCSC knownGene (hg19) like this:

SELECT cdsStart,cdsEnd,K.name,exonStarts,exonEnds FROM knownGene as K, kgXref as X WHERE
X.kgId=K.name and K.name='uc002imy.2'


The output (with new-lines added so that exonStarts and exonEnds line up):

cdsStart        cdsEnd  name    exonStarts      exonEnds
46103793        46115139        uc002imy.2
46103534,46105837,46106490,46109521,46110051,46110576,46111228,46114216,46115032,46115092,46115124,
46103841,46105876,46106542,46109599,46110107,46110668,46111310,46114291,46115092,46115122,46115152,


Note that the 2nd-from-last exonStart is the same as the 3rd-from-last exonEnd (46115092). What does this mean. A single row in knownGene is a single transcript, so what does it mean to have a zero-length intron? There are enough of these that I want to understand what is going on.

I have asked this question on the UCSC mailing list but no answer yet.

ucsc exon splicing bed transcript • 2.1k views
0
Entering edit mode

A response on the mailing list explains that it's due to gaps on the query relative to the transcript sequence. I hadn't thought about these issues before now.

1
Entering edit mode
9.9 years ago
Scott Cain ▴ 750

I wonder if there is a CDS boundary there, like a stop codon. Sometimes I've seen data goofs where one exon is split into two when part of it is coding and the other isn't.

0
Entering edit mode

Could be... though this occurs even in transcripts that are (annotated as) non-coding.

1
Entering edit mode
9.9 years ago
Pi ▴ 520

Could it be intron retention?