I have a csv file that I've downloaded from this paper here (abridged version below), and it's referring to CpG locations in the genome from beadChip array data, using assembly hg18 as follows:
CpGmarker Build Chr MapInfo SourceVersion TSS_Coordinate Gene_Strand Symbol Synonym Accession GID cg00075967 36 15 72282407 36.1 72282245 - STRA6 PP14296; FLJ12541; NM_022369.2 GI:21314699 cg00374717 36 17 63814740 36.1 63815191 + ARSG KIAA1001; NM_014960.2 GI:45430056 cg00864867 36 12 78609399 36.1 78608921 - PAWR PAR4; Par-4; NM_002583.2 GI:55769532 ...
Let's focus on the first entry for now, STRA6. I downloaded the Gencode gtf files (also for hg18 here), and the corresponding lines look like this (again, abridged for simplicity):
#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts cdsEndStat exonFrames 1136 ENST00000395105 chr15 - 72258859 72282273 72259473 72281661 19 72258859,72260175,72260688,72261554,72261736, STRA6 cmpl cmpl 1136 ENST00000323940 chr15 - 72258859 72288424 72259473 72281661 19 72258859,72260175,72260688,72261554,72261736, STRA6 cmpl cmpl
The general locations line up pretty closely with the csv above, but am I being naive for thinking that the "TSS" should just be the furthest upstream position in the gtf? i.e. the lowest position within an exon (or the txStart) for genes on the + strand (highest for "-" strand ) ? It seems obvious, but I'm doubting myself because none of these values agree with the value that is provided for the "TSS" in the csv of the publication. I've also tried with RefSeq and a few others --nothing seems to match up precisely.
So am I reading the gtf files correctly to infer the TSS (is it just txStart for "+" strand and txEnd for "-"?), and if so, is there some other obvious reason why the TSS coordinates in the paper's csv file aren't the same?
Thanks for any help you can offer.