List of TSS from GTF file
1
0
Entering edit mode
3.5 years ago
Roman Hillje ▴ 40

Hi,

I'd like to create lists of transcription start sites based on the GTF files of the multiple annotation sets we have in our group. For example, for mm10 we have GENCODE annotation, RefSeq annotation, refGene etc. For each of them, I have a GTF/GFF file. Having a list of TSS regions is particularly helpful when doing ChIPseq analysis.

What I tried so far is the following protocol:

• reading the GTF file using readGFF (rtracklayer package)
• define the TSS as the start position for each entry that is on the + strand and as the end position of every entry that is on the - strand
• get all unique TSS (per chromosome)

However, using the GENCODE annotation (very comprehensive) I end up with ~420k TSS (all) or ~350k TSS (protein coding transcripts). This is a bit too much, considering that there are ~50k unique genes in the list.

Do you have any recommendation for how to reduce the list? For example, I could take the first/last TSS for each gene, but I don't know what is a solid way to proceed here.

Any suggestion is appreciated. If it is easier, I would also use an online reference to get the TSS from but I thought it was most coherent to use the same annotation files for all the analysis (including RNAseq data).

Thanks, Roman

TSS GTF • 4.8k views
1
Entering edit mode

Are you just pulling the start position of each entry (each line) in GTF ?

0
Entering edit mode

Hi Alex..

I’m planning to create my own file of TSS with upstream and downstream region using the gencode annotation gtf file..I saw your post and I would like to know more about how did you upload the gtf file in R, how did you define the TSS regions and etc. could you please help me with that??

Thanks

2
Entering edit mode
3.5 years ago

First filter the GTF file for genes, so that you generate a list of TSSs based on gene annotations.

0
Entering edit mode

Thanks, I don't know how I missed this obvious step. That brought the list down to a manageable number (~50k).

And taking the start/end for +/- strand, respectively, is the right way to go?

0
Entering edit mode

Yes, though you might also look at CAGE data for experimental confirmation of TSSs.