I'd like to create lists of transcription start sites based on the GTF files of the multiple annotation sets we have in our group. For example, for mm10 we have GENCODE annotation, RefSeq annotation, refGene etc. For each of them, I have a GTF/GFF file. Having a list of TSS regions is particularly helpful when doing ChIPseq analysis.
What I tried so far is the following protocol:
- reading the GTF file using
- define the TSS as the
startposition for each entry that is on the
+strand and as the
endposition of every entry that is on the
- get all unique TSS (per chromosome)
However, using the GENCODE annotation (very comprehensive) I end up with ~420k TSS (all) or ~350k TSS (protein coding transcripts). This is a bit too much, considering that there are ~50k unique genes in the list.
Do you have any recommendation for how to reduce the list? For example, I could take the first/last TSS for each gene, but I don't know what is a solid way to proceed here.
Any suggestion is appreciated. If it is easier, I would also use an online reference to get the TSS from but I thought it was most coherent to use the same annotation files for all the analysis (including RNAseq data).