Question

Annotating TSS: By Transcript or by Gene? Code Validation Help Needed!

5

Entering edit mode

11 months ago

Rafael Soler ★ 1.2k

Hey everyone,

I'm currently working on annotating TSS (Transcription Start Site) information, and I have come across a dilemma: should I annotate TSS by transcript or by gene? I would really appreciate your insights and opinions on this matter.

On a related note, I have been trying to implement a code to annotate a TSS bed file from non-canonical species using a GTF file independently of being a NCBI or Ensembl GTF. I would like to verify if my code is correct or if there are any improvements that can be made. Here's the code I have so far:

import argparse

def gtf_to_tss_bed(input_file, output_file):
    tss_coordinates = set()  # Set to store unique TSS coordinates

    with open(input_file, 'r') as gtf, open(output_file, 'w') as bed:
        for line in gtf:
            if not line.startswith('#'):
                fields = line.strip().split('\t')
                if len(fields) >= 8 and fields[2] == 'transcript':
                    chrom = fields[0]
                    start = int(fields[3]) - 1  # Adjust to 0-based index
                    end = int(fields[4])
                    strand = fields[6]
                    transcript_id = fields[8].split('transcript_id "')[1].split('";')[0]  # Extract transcript_id

                    if strand == '+':
                        tss = start
                    else:
                        tss = end - 1

                    coordinates = (chrom, tss, tss + 1)
                    if coordinates in tss_coordinates:
                        continue  # Skip if coordinates already seen

                    tss_coordinates.add(coordinates)  # Add coordinates to set
                    bed_fields = [chrom, str(tss), str(tss + 1), transcript_id, strand]  # BED format: [chrom, start, end, transcript_id, strand]
                    bed.write('\t'.join(bed_fields) + '\n')

# Parse command-line arguments
parser = argparse.ArgumentParser(description='Convert GTF file to TSS BED file')
parser.add_argument('-g', '--gtf', help='Input GTF file', required=True)
parser.add_argument('-o', '--output', help='Output BED file', required=True)
args = parser.parse_args()

# Call the conversion function
gtf_to_tss_bed(args.gtf, args.output)

Thank you in advance,

Rafael

genome bed transcript TSS • 1.6k views

ADD COMMENT • link 7 months ago by Rafael Soler ★ 1.2k

3

Entering edit mode

I've always had this doubt too, IMO probably annotating on transcript (Tx) TSS is the most useful but it also may depend on your biological question and on the data at hand.

For example, in the past I've observed for some ChIP-seq data and also for DNAm data that my signal was functionally associated with Tx TSSs (in the sense that, when exploring the data, there were specific enrichments/depletions for the factor or DNAm at the level of isoform features such as TSS) so to me it made sense to use the Tx level.

Moreover, after annotating at the Tx level, you still have the option of assigning each Tx to their corresponding gene. The only issue I can imagine with annotating many features to the Tx level (TSS, exon, intron...) is that maybe sometimes if there are too many overlapping Txs in a "gene body", in the end your final annotation (chosen maybe with some priority) would not be too informative of that structure. But again, this simplification also occurs when annotating to the gene level. Also, the concept of a "gene" and its delimitations is (to me) more obscure than the concept of a transcript. Variability in annotating to the Tx level may also be more influenced by the quality of the organisms' annotation in the sense of the confidence of all of the transcripts there described and so on.

Just some thoughts... it would be nice if someone can suggest cases when annotating to the gene level is the preferred option!

ADD REPLY • link 11 months ago by Papyrus ★ 2.9k

3

Entering edit mode

I think (generally) most TSSs for transcripts start at the same point within some margin of error. I would worry about overrepresenting some gene TSSs if the only difference between transcripts is alternative splicing.

At the end of the day, as long as it's abundantly clear in the methods section which of the two approaches was taken, and any downstream filtering needed, it shouldn't matter.

ADD REPLY • link 11 months ago by Trivas ★ 1.7k

1

Entering edit mode

I was thinking that the best option is to do it by transcript due to this approach takes into account the alternative promoters and transcription start sites that may be specific to certain transcripts. It provides a more detailed and fine-grained representation of TSS usage within a gene, allowing for a deeper understanding of transcript-specific regulation. However, the TSS enrichment will be higher ¿? because you will be able to annotate more TSSs. Also for ATAC-seq or CHIP-seq studies, the TSS of the gene not always is the correct one for your tissue. It is true that if they share TSS, it is also not informative, but at least more informative than by gene.

ADD REPLY • link 11 months ago by Rafael Soler ★ 1.2k

0

Entering edit mode

7 months ago

Rafael Soler ★ 1.2k

I would like to have this discussion back in case anyone has anything else to contribute.

ADD COMMENT • link 7 months ago by Rafael Soler ★ 1.2k

score 3 · Accepted Answer · 2023-09-16

There is no such thing as a gene TSS - genes don't have transcription start sites, transcripts do. Generally what people mean by "gene TSS" is the TSS of the most 5' transcript associated with that gene (no guarantee this is the most important transcript, or that its even expressed in the sample you are dealing with). Its possible that in some circumstances that is the correct thing to use, but I think it is probably often not.

Three options we've used in the past:

Use all TSSs of transcripts with a TPM>1 in that condition. Its often neccessary to merge the TSSs as that when you have multiple transcripts starting at the same location, you don't count the sequence twice.
Use the TSS of the mostly highly expressed transcript in that condition.
Use the TSS of the most 5' transcript that is expressed with a TPM>1.

All of these of course require expression data.