How exactly can I define promoter region from GENCODE annotation GTF file?
1
0
Entering edit mode
14 months ago

Hi,

I want to extract the promoter of every protein-coding genes in the genome. My colleagues suggested to use GENCODE GTF annotation file

However, upon looking at the file content, I didn't see any "promoter" defined anywhere?

For example, here's the GENCODE entry for the "TERT" gene: (GENCODE v43) enter image description here

My simple mind would simply take, say 1000bp upstream of the "start" coordinate under the "gene" feature (first row) and assume that every distinct transcript of the gene have the same promoter, Is this a sensible thing to do? Or am I completely wrong here?

Thank you so much!

promoter protein-coding GENCODE annotation • 2.3k views
ADD COMMENT
1
Entering edit mode

You won't find a GTF of "official" human promoter regions. Most genome-wide promoter annotations are inferred from ChIP-seq studies looking at histone modifications and TF binding. There are some databases floating around that people have put together based on ChIP-seq data and/or other data.

To better understand how to advise you, can you elaborate on what you want to do with the promoter regions?

ADD REPLY
0
Entering edit mode

Thanks for your insights

Are you familiar with Hi-C? What I want to do is find the genomic regions/fragments that interact with the promoter of every protein coding genes

In essence, the result will be similar to if I had done Promoter Capture Hi-C instead (Promoter Capture Hi-C is a version of Hi-C that only identify the interaction with gene promoters), Now.. I wonder if promoters are inconsistent and incompletely annotated, how do people do Promoter Capture Hi-C then hmm.. I thought this will require the existence of some "official" promoters of every gene

ADD REPLY
1
Entering edit mode

Does your Hi-C really has a resolution to do promoter-level analysis, which would almost be 1kb resolution? For starters it could be possible to simply use 1kb upstream of every annotated TSS (respective strand).

ADD REPLY
0
Entering edit mode

The thing is, GENCODE doesn't annotate TSS either

Sooo by TSS, do you mean just the "start" ("end" for neg. strand) coordinate of the row labelled "gene" in the GTF file? (the first row on pic above).

And yes, the Hi-C library have a resolution of exactly 1kb,

ADD REPLY
1
Entering edit mode

Just out of interest, how many billion reads are necessary for 1kb?

ADD REPLY
0
Entering edit mode

Unfortunately, I'm not the one who make the hi-c library (It was one of the postdoc in my lab), so I'm not familiar with that

All I'm getting is a clean interaction data

ADD REPLY
2
Entering edit mode
14 months ago
jv ★ 1.8k

From Promoter capture Hi-C-based identification of recurrent noncoding mutations in colorectal cancer it looks like they used data from Ensembl for promoter sequences.

The follow may also be helpful:

https://useast.ensembl.org/Help/Faq?id=386

ADD COMMENT
0
Entering edit mode

Hmmm, thank you for the paper, it was quite insightful to see how different people approach this

Thanks thanks thanks a lot!!

ADD REPLY

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6