Question: List of TSS from GTF file
0
gravatar for Roman Hillje
19 months ago by
Roman Hillje40
Milan, Italy
Roman Hillje40 wrote:

Hi,

I'd like to create lists of transcription start sites based on the GTF files of the multiple annotation sets we have in our group. For example, for mm10 we have GENCODE annotation, RefSeq annotation, refGene etc. For each of them, I have a GTF/GFF file. Having a list of TSS regions is particularly helpful when doing ChIPseq analysis.

What I tried so far is the following protocol:

  • reading the GTF file using readGFF (rtracklayer package)
  • define the TSS as the start position for each entry that is on the + strand and as the end position of every entry that is on the - strand
  • get all unique TSS (per chromosome)

However, using the GENCODE annotation (very comprehensive) I end up with ~420k TSS (all) or ~350k TSS (protein coding transcripts). This is a bit too much, considering that there are ~50k unique genes in the list.

Do you have any recommendation for how to reduce the list? For example, I could take the first/last TSS for each gene, but I don't know what is a solid way to proceed here.

Any suggestion is appreciated. If it is easier, I would also use an online reference to get the TSS from but I thought it was most coherent to use the same annotation files for all the analysis (including RNAseq data).

Thanks, Roman

tss gtf • 1.8k views
ADD COMMENTlink modified 19 months ago by Alex Reynolds29k • written 19 months ago by Roman Hillje40
1

Are you just pulling the start position of each entry (each line) in GTF ?

ADD REPLYlink written 19 months ago by geek_y10k

Hi Alex..

I’m planning to create my own file of TSS with upstream and downstream region using the gencode annotation gtf file..I saw your post and I would like to know more about how did you upload the gtf file in R, how did you define the TSS regions and etc. could you please help me with that??

Thanks

ADD REPLYlink written 7 months ago by munaj860
1
gravatar for Alex Reynolds
19 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

First filter the GTF file for genes, so that you generate a list of TSSs based on gene annotations.

ADD COMMENTlink written 19 months ago by Alex Reynolds29k

Thanks, I don't know how I missed this obvious step. That brought the list down to a manageable number (~50k).

And taking the start/end for +/- strand, respectively, is the right way to go?

ADD REPLYlink written 19 months ago by Roman Hillje40

Yes, though you might also look at CAGE data for experimental confirmation of TSSs.

ADD REPLYlink written 19 months ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1618 users visited in the last hour