Exon Range Vs. Exonstart In Ucsc
7.7 years ago
Max ▴ 140

This is a revised version of an earlier query that may not have been stated very clearly:

I have noticed a mismatch between the coordinates given by ExonStart / ExonEnd and exon range from the UCSC genome browser's annotation of hg19 human reference genome.

Specifically, the exonStarts and exonEnds coordinates that are given do not match the exon range given when the sequences are called. Typically, the exonStarts coordinate is 1 nucleotide prior to the exonStarts, as in the example below:

name    chromosome    strand    exonStarts     exonEnds      exonFrame
NM_030806    chr1           +    184559872        184559949   1

While the range is
GAAAAAAGTGCCAGCTCAAATGTAAGACTTAAAACTAATAAAGAGGTTCCGGGATTAGTTCATCAACCCAGAGCAAA


Usually the mismatch between exonStarts and range is +1 nucleotide, but sometimes it is more than this. What is the reason for the discrepancy between range and exonStarts/exonEnds, and which number is the actual coordinate of the first nucleotide in the exon?

7.7 years ago
brentp 23k

The first format is 0-based start (https://genome.ucsc.edu/FAQ/FAQformat.html#format9) the fasta header is showing it as 1-based start. 0-based start means that the first base for the entire chromosome is 0. 1-based start means that it is 1.

The end coordinate remains the same because for 0-based systems, it is non-inclusive (doesn't include the end) (https://genome.ucsc.edu/FAQ/FAQformat.html#format1).

This is because 0-based is nice for programmers and computers and 1-based is nice for "normal people".

Thanks.

However, if that were the case, wouldn't both the exonStarts and exonEnds be -1 with respect to the FASTA coordinates? Instead, the exonStart is -1 with respect to fasta, while the exonEnds match.

Also, which of the coordinate systems (0 start or 1 start) is consistent with ensembl coordinates?

Thanks for the update. I assume that the coordinates in ensembl annotation start at 1?