As an example, a single row from UCSC knownGene (hg19) like this:
SELECT cdsStart,cdsEnd,K.name,exonStarts,exonEnds FROM knownGene as K, kgXref as X WHERE X.kgId=K.name and K.name='uc002imy.2'
The output (with new-lines added so that exonStarts and exonEnds line up):
cdsStart cdsEnd name exonStarts exonEnds 46103793 46115139 uc002imy.2 46103534,46105837,46106490,46109521,46110051,46110576,46111228,46114216,46115032,46115092,46115124, 46103841,46105876,46106542,46109599,46110107,46110668,46111310,46114291,46115092,46115122,46115152,
Note that the 2nd-from-last exonStart is the same as the 3rd-from-last exonEnd (46115092). What does this mean. A single row in knownGene is a single transcript, so what does it mean to have a zero-length intron? There are enough of these that I want to understand what is going on.
I have asked this question on the UCSC mailing list but no answer yet.
A response on the mailing list explains that it's due to gaps on the query relative to the transcript sequence. I hadn't thought about these issues before now.