Question

How to infer the TSS from a gtf file

0

Entering edit mode

2.4 years ago

Bosberg ▴ 50

I have a csv file that I've downloaded from this paper here (abridged version below), and it's referring to CpG locations in the genome from beadChip array data, using assembly hg18 as follows:

CpGmarker   Build   Chr MapInfo SourceVersion   TSS_Coordinate  Gene_Strand Symbol  Synonym Accession   GID
cg00075967  36  15  72282407    36.1    72282245    -   STRA6   PP14296; FLJ12541;  NM_022369.2 GI:21314699
cg00374717  36  17  63814740    36.1    63815191    +   ARSG    KIAA1001;   NM_014960.2 GI:45430056
cg00864867  36  12  78609399    36.1    78608921    -   PAWR    PAR4; Par-4;    NM_002583.2 GI:55769532
...

Let's focus on the first entry for now, STRA6. I downloaded the Gencode gtf files (also for hg18 here), and the corresponding lines look like this (again, abridged for simplicity):

#bin    name            chrom   strand  txStart         txEnd           cdsStart    cdsEnd  exonCount   exonStarts  cdsEndStat  exonFrames  
1136    ENST00000395105 chr15   -   72258859    72282273    72259473    72281661    19  72258859,72260175,72260688,72261554,72261736,   STRA6   cmpl    cmpl
1136    ENST00000323940 chr15   -   72258859    72288424    72259473    72281661    19  72258859,72260175,72260688,72261554,72261736,   STRA6   cmpl    cmpl

The general locations line up pretty closely with the csv above, but am I being naive for thinking that the "TSS" should just be the furthest upstream position in the gtf? i.e. the lowest position within an exon (or the txStart) for genes on the + strand (highest for "-" strand ) ? It seems obvious, but I'm doubting myself because none of these values agree with the value that is provided for the "TSS" in the csv of the publication. I've also tried with RefSeq and a few others --nothing seems to match up precisely.

So am I reading the gtf files correctly to infer the TSS (is it just txStart for "+" strand and txEnd for "-"?), and if so, is there some other obvious reason why the TSS coordinates in the paper's csv file aren't the same?

Thanks for any help you can offer.

gtf regions TSS gene body • 2.1k views

ADD COMMENT • link updated 2.4 years ago by Alex Reynolds 35k • written 2.4 years ago by Bosberg ▴ 50

score 1 · Answer 1 · 2021-11-30

1

Entering edit mode

2.4 years ago

Alex Reynolds 35k

You could extract canonical gene entries from the Gencode file, and then take the start position with respect to strand as the definition of the TSS.

Labeling a canonical transcript could as simple as finding the longest isoform of a gene, or by using APPRIS annotations, where available. Some Gencode downloads will have appris_* tags for this purpose. APPRIS annotations are nice, because they derive support from biological function, instead of genomic length.

Another option is to use TSS derived from CAGE data, such as from refTSS. A quick scan of their raw data downloads suggests they have hg38 data, which you would likely need to liftOver to hg18 space (given that you mention that this is the assembly you are working with). Searching biostars on liftOver will provide suggestions on how this is done.

A difference in assembly may also be a reason the authors' coordinates may differ from your Gencode-derived coordinates, which may not be in hg18 space.

ADD COMMENT • link 2.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for the suggestion --I should have been more clear that both the csv _and_ the gtf were for hg18, so they are using the same assembly (but that's a good thing to check, I edited that in above.)

You could extract canonical gene entries from the Gencode file, and then take the start position with respect to strand as the definition of the TSS.

ok so then I'm not being naive. The TSS really is just the "furthest upstream" position included in the gene body (with the possible complication of multiple isoforms). That's helpful, thanks.

ADD REPLY • link 2.4 years ago by Bosberg ▴ 50

1

Entering edit mode

To be clear, picking the first base is a common definition of convenience, but one that may not always hold uniformly true. Introns may contain alternative TSSs, for instance, and genes can have several alternative promoters:

The transcription of a gene may start from one of several TSSs, a phenomenon known as alternative transcriptional initiation (ATI); the different core promoters used are referred to as alternative promoters [8, 9]. It has been reported that ATI occurs to most eukaryotic protein-coding genes [6, 7, 10–12]. For example, over 50% of all human genes have alternative promoters [13], and on average, a human gene has four TSSs [7].

Ref: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000197

For ad-hoc analyses, using the first base of a canonical isoform to define a proximal promoter or otherwise start with at least one TSS for a given gene is probably fine.

If you're doing anything more complicated, however, particularly if it involves regulation, it may be worth looking at your data more deeply.

ADD REPLY • link 2.4 years ago by Alex Reynolds 35k