Question: Genes with promoter and enhancer regions as GTF
0
gravatar for hkarakurt
12 months ago by
hkarakurt50
hkarakurt50 wrote:

Hello, I am doing a ChIP-Seq analysis and I have sorted SAM files (I used MACS). I used featureCounts and GTF file from USCS Table Browser. I have chosen "Genes and Gene Predictions" as group, "UCSC Genes" as track and "knownGene" as table. Now I have Counts matrix for my ChIP-Seq experiments and I used DESeq2. This method is used as I know from posts from Bioconductor forums.

But the problem is, my GTF file have CDS, mRNA, start and stop codons. As I know promoter regions are +1000 bp up from gene start mostly and I am not sure my GTF file have this parts or not. I also have no idea about enhancer regions.

How can I download a GTF file which also have promoter and enhancer regions. Please I am in hurry and need help.

Thank you.

ADD COMMENTlink modified 12 months ago • written 12 months ago by hkarakurt50
3
gravatar for venu
12 months ago by
venu6.2k
Germany
venu6.2k wrote:

Okay, first things first

Please I am in hurry and need help

Slow down and think again what you are doing.

I used MACS

Used MACS for what? (BTW, convert SAM to BAM and save some space. Just a suggestion). I'm sure you wanted to find out ChIP enriched regions, which you can do with MACS.

I used featureCounts and GTF file from USCS Table Browser

To find out what? I assume, what you wanted to do here is counting reads that fall in MACS identified peak regions. You don't need a GTF for this. All you need is BED format of peak regions and bamCoverage from deeptools (there are others as well, e.g. bedtools).

I have Counts matrix for my ChIP-Seq experiments and I used DESeq2

Count matrix for how many samples? And what did you found from DESeq2?

I suppose you want to find out differentially active promoters from the ChIP-seq data(?). In that case download Gencode annotation file and get gene level promoter regions (or transcript level, depending on your requirement) and use those regions to generate a count matrix and apply DESeq2. You will get differentially active promoters for your conditions.

How can I download a GTF file which also have promoter and enhancer regions

If the purpose of this GTF is to count reads falling those regions, you don't need a GTF, simple BED files are enough.

ADD COMMENTlink written 12 months ago by venu6.2k
2
gravatar for Friederike
12 months ago by
Friederike4.2k
United States
Friederike4.2k wrote:

First of all: this is not as trivial as you might hope for it to be. Ergo, this issue may warrant some thought and some trial and error on your part. Just because you're in a hurry now shouldn't mean you shouldn't be revisiting the choices you may be making after reading my response.

Enhancers are totally cell (and possibly condition) specific, plus there isn't even a consensus of how to define enhancers in a uniform manner. Thus, I'm not aware of a GTF file from the usual Genome Data Repositories that will contain all enhancers ever defined.

You can browse the data at UCSC Table Browser -- just choose "group: Regulation" instead of "Genes and Gene Predictions". You could then, for example, choose the DNase Track as a proxy for open chromatin (and enhancers tend to be open, so are active promoters). You would have to pay attention to the cell type though as K562 cells might not be what you want.

Promoters are even less well defined, depending on the model organism and the preference of the PI, they may be any region between 200 to 10 000 bp up- and/or downstream of the TSS. Here's a classic biostars post on how you could do that yourself given a BED file.

Edit: Yes to everything Venu wrote!

ADD COMMENTlink modified 12 months ago • written 12 months ago by Friederike4.2k
1

@Friederike,

Regarding second paragraph, if it's really about publicly available "enhancers", one can look at FANTOM database (or simply download supplementary table from this publication: A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples. They pre-processed the list very well.)

Regarding fourth paragraph, I couldn't agree more. I have seen publications defining promoters as 100bp up and 50bp down & 2000bp up and down. But if the analysis includes quantification of differential activity, I think it is better to stick with min 1000bp up and down.

ADD REPLYlink written 12 months ago by venu6.2k

Thanks for the info about FANTOM!

ADD REPLYlink written 12 months ago by Friederike4.2k
0
gravatar for hkarakurt
12 months ago by
hkarakurt50
hkarakurt50 wrote:

Thank you for answers. I will explain more clearly and tell the reasons. I used MACS for peak calling. I also have BAM files I just forgot to mention that. I used featureCounts to quantify my peaks. I used GTF file as annotation. I have 7 samples for H3K4me1 and 6 samples for H3K27ac ChIP-Seq experiments.

I have done this because I need differentially enriched regions and the names of these regions to compare and integrate (or at least try it) with RNA-Seq data of same experiments. In NCBI GEO page of data set, I saw researchers use featureCounts and EdgeR for differential binding analysis. It is here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2889136

Also, https://support.bioconductor.org/p/109154/#109742 is my post and in Bioconductor forums, people said DESeq2 can be used. I just want to compare the peaks between conditions.

I used term "GTF" so many times but I should have say annotation instead of it.

Thank you again.

ADD COMMENTlink written 12 months ago by hkarakurt50

differentially enriched regions and the names of these regions to compare and integrate (or at least try it) with RNA-Seq data of same experiments

using MACS peaks is probably the first step. I would argue that before you start searching for semi-random lists of enhancers ,you think about how you want to integrate those different data types. For example, how will you associate a given enhancer with the gene it's supposed to regulate? Maybe you can use the RNA-seq data to find your enhancers (e.g., intergenic or intronic regions with signs of transcription although that's not a trivial analysis as you need to control for possible genomic DNA contamination). To me it sounds like you need to think a little more about the biological background of the analyses you're about to do and then go back and think about the types of region annotations you might want to use.

ADD REPLYlink written 12 months ago by Friederike4.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour