Question: Annotate peaks based on Histone Mark
3
gravatar for victor gaborit
3.8 years ago by
Nantes, France
victor gaborit70 wrote:

Hi everyone,

I've performed ChIP-seq analysis on TFs for a specific cell line for which I only have three histones marks (H3K4me1, H3K27ac and H3K4me3). I used HOMER to annotate differents peak from my datas but now I would like to use these histones marks to improve the annotation (for example if one peak is present as the same region than H3K27ac and H3K4me1 peaks, I would like to say that this region is an active enhancer).

Does somebody have already done this kind of analysis?

Thanks in advance,

chip-seq homer • 2.8k views
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by victor gaborit70
7
gravatar for Sinji
3.8 years ago by
Sinji3.0k
UT Southwestern Medical Center
Sinji3.0k wrote:

It sounds like you're trying to identify 'active enhancers'. There's a couple of interesting options you have to do this. For visualization @Vivek Bhardwaj's answer is great. But i'm assuming that you'd like to have a list of putative 'active enhancers' in which to over-lay your TF ChIP-seq data.

So literature tells us that enhancers (whether active or inactive) are typically marked with H3K4me1, and active enhancers are typically marked with H3K27Ac. What i'm getting from your post is that you aren't aware that there is also literature that has shown H3K4me3 to be associated with active enhancers. So you can see that enhancer definitions are sort of scattered, and a lot of histone modification will be cell line specific.

The first and simplest option is to use bedtools intersect to look for overlapping peaks of H3K4me1 + H3K27Ac. There's a couple of caveats to this, but it's a good starting point. To do this, you have to take into account that enhancers are typically distal sites from genes (though there ARE intragenic enhancers, but my post does not cover them, they're a whole other mess). So your best bet is to head on over to the UCSC Table Browser and download a RefSeq or Gencode annotation for your species of all coding transcripts and then use bedtools slop to extend all transcripts' TSS and TSE +5k in both directions.

bedtools slop -i $REFSEQ -g $GENOME -b 5000 > $OUTPUT

After getting your sloped transcripts, you can then overlay your histone data and identify peaks that contain H3K4me1 + H3K27Ac overlap. You could then argue that these areas are 'probable enhancers'.

bedtools intersect -a $H3K4me1 -b $REFSEQ_SLOPED -v | bedtools intersect -a - -b H3K27Ac > $OUTPUT_ENHANCERS

Your other option is to use PARE to help you identify possible enhancers based on this PVP pattern as indicated in a couple of enhancer reviews. Typically enhancers are marked with large 'peaks' of histone marker followed by a low 'valley' of no signal, and then another 'peak' of histone marker. Think: /\ __ /\

You could supply your H3K4me1 + H3K27Ac files to PARE instead of using two H3K4me1 replicates because you are interested in active enhancers. If you only care about identifying enhancers (poised or active) then you can use PARE as the manual suggests without modifications.

ChromHMM is also something that you could take a look at. It's more complicated than the above, and it requires you to do a bit of reading in order to be know what histone markers are typically found or annotate what regions of the genome. But essentially ChromHMM uses a hidden markov model to identify the presence or absence of each chromatin mark and then discovers combinational 'states' which can be used to annotate a genome.

I hope some of this is useful to you. Let me know if you have any questions.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Sinji3.0k

@Sinji I have a few questions about the first part of your suggestion where you are using bedtools. As per my understanding first you decide the region you want to look at for finding the Enhancer and link them to their target genes. For example in this paper they used 100 KB region up and down stream of TSS.

  1. So, if I use your suggestion I have to get sloped region with: bedtools slop -i $REFSEQ -g $GENOME -b 100000 > $OUTPUT Am I right?
  2. Secondly, you mentioned "The -v parameter essentially means "I want all the H3K4me1 peaks that are NOT in the REFSEQ_SLOPED file" which is what we want" But here I think we "WANT" those region which are within the domain we defined (100kb) and have the H3K4me1 and H3K27ac peaks within up or downstream of the gene's TSS. It will be great help if you can further shed some light on this. Thanks
ADD REPLYlink written 3.5 years ago by Bioinformatist Newbie240

This was by far one of the best examples of finding enhancers! I have been looking for clues of how people are doing it but this is really gold! Thanks a lot, @Sinji for giving these hints! Really appreciate it!

ADD REPLYlink written 3 months ago by sikhtechai30
1
gravatar for vivekbhr
3.8 years ago by
vivekbhr540
Germany
vivekbhr540 wrote:

You can cluster your peaks using the histone marks signals in the region and visualize them using deepTools2... Have a look at computematrix and plotHeatmap/plotProfile tools.. (https://deeptools.readthedocs.org)

ADD COMMENTlink written 3.8 years ago by vivekbhr540
0
gravatar for victor gaborit
3.8 years ago by
Nantes, France
victor gaborit70 wrote:

Thank you for your answer Vivek and Sinji, both of them seems to be useful for me. I didn't have heard about deeptools before which seems to be good to visualization as Sinji said but I will use it.

Looking at your answer Sinji, I choose before to consider H3K4me3 to be associated strongly with promoters for the moment and just focus on H3K4me1 and H3K27ac for enhancers looking. If I succeed that, you're right I'll have to take in account that H3K4me3 also could be associated to active enhancers.

I've just tried your first solution with bedtools which seems to be more efficient than what I've done before.

I will try to run PARE too because identifying poised enhancers should be also interesting for my project.

Concerning ChromHMM, I've already tried to use it and I agree that it require a lot of reading but I'm not confident a lot to build a hidden Markov model with only one cell line and just few histone marks. The result I get from ChromHMM were a bit strange and giving a meaning to different state was really difficult.

Thank you both again for your answers!

ADD COMMENTlink written 3.8 years ago by victor gaborit70

I made a huge error in my coding example by not including the -v parameter in the bedtools intersect function. Please go back and re-do the analysis or else your results are incorrect. My apologies.

The -v parameter essentially means "I want all the H3K4me1 peaks that are NOT in the REFSEQ_SLOPED file" which is what we want.

ADD REPLYlink written 3.8 years ago by Sinji3.0k

Thank you for your post it was very useful for me ( better with the -v option naturally). This method with bedtools seems to be good and more simple compare to ChromHMM especially when you have few histones marks.

Thanks again,

ADD REPLYlink written 3.8 years ago by victor gaborit70

To combat this issue you could look for your cell type in the Encode Project website and use their histone data for your purposes.

ADD REPLYlink written 3.8 years ago by Sinji3.0k

Thank you for your reply but unfortunately, the cell line I want to work on have no data on ENCODE project (special myeloma cell line)

ADD REPLYlink written 3.8 years ago by victor gaborit70

@Victor I have posted a question in response to @Sinji reply to your main question, if you can help me in this context then it will be a great help. Additionally, guide me how you obtained the genome and annotation files. Thanks

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Bioinformatist Newbie240

Hi, To find genome and annotation file, I used, as Sinji reply in my question, you can download files from UCSC to obtain RefSeq file for coding transcript. for the genome file, it is just one tab-delimited file containing those two columns : chromosome name and chromosome length you can find that file also in UCSC for your species. see bedtools for more information

To answer to your questions: Here we choosed to extend TSS and TES positions to +/-5kb assuming that a regulatory region which are closer than 5 kb to a TSS is considered generally more as promoter than enhancers. This distance can change depending on literature (Personally I read that enhancers could be found with a distance of 2kb from TSS or 5kb from TSS).

So in the command that Sinji gave us, it mean that you want all your H3K4me1 peaks that are more than 5Kb distant from TSS (that's why we used the '-v' option).

But if you just want obtain your peak that are in less than 100kb from TSS, yes you are right you don't have to use the '-v' option

If you have other questions, please let me know

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by victor gaborit70

Hi, Thank you for replying. Kindly bear with my naivety, as I am very new to this kind of analysis so I am very short on information about it. It will be great help if you can help me to sort out the following confusions:

  1. To obtain RefSeq Annotation file, I went to UCSC and selected Clade: Mammal, genome: human, assembly: hg19, group: genes and genes prediction, track: refseq genes, region: genome and clicked 'get output', there I selected all the fields. To obtain .genome file, I didn't get how I can obtain it from UCSC so I found and used this one. Now when I ran this command: bedtools slop -i hg19_RefSeq.gtf -g hg19.genome -b 100000 > slop.txt I got the following error: Unexpected file format. Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 2? Now I am not getting what is wrong with my annotation file.
  2. The purpose of Annotation file is to give information about the known TSS of genes, by using slop we are considering -100kb and +100kb region to further find the peaks overlapping in those regions. Am I right?
  3. If I correctly get the slope file then I will use bedtools intersect -a H3K4me1.broadPeak -b slop.txt | bedtools intersect -a H3K27ac.broadPeak -b slop.txt > Active_Enhancers.txt this command to get a list of Active Enhancers. What this intersect option does is giving me the peaks (K4me1 and K27ac) which falls within 100kb up/down stream of the TSS. Am I right?

Thanks again.

ADD REPLYlink written 3.5 years ago by Bioinformatist Newbie240

there is few points to correct

1 from your RefSeqfile and if you are only interested by TSS :

cat hg19_RefSeq.gtf | grep -v "^#" | awk '{print $3"\t"$5"\t"$5}' > hg19_RefSeq_TSS.bed

your problem is your extension file which are not in expected bed format, with this command you will obtain bed file without header and the exact coordinate of TSS this should solve your first problem and then you'll get a file with +/-100kb from TSS.

2 This method was proposed to create a list of potential active enhancers position we don't mind which gene is it. but yes you will have this information

3 There is an error in your command line :

bedtools intersect -a H3K4me1.broadPeak -b slop.txt | bedtools intersect -a - -b H3K27ac.broadPeak > Active_Enhancers.bed

The first bedtools intersect is to obtain position of H3K4me1 peaks that falls into +/-100kb from one TSS and the second one is to obtain on this peaks, the peaks that have also H3K27ac peaks in this position indicating POTENTIAL Active enhancers (as I told you, you can also obtains promoters with this method if you don't exclude all peaks that falls in less than +/-5kb from TSS)

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by victor gaborit70

Thank you very much Victor, it solved the problem. I think your command for preparing bed file is not right. I think what you wanted to do was to extract 3 columns chr, transcription start position and transcription end position. Am I right? actually that's what I did and it worked out. My K4me1 peak file had 321459 peaks and K27ac peak file had 897959 but the new Active_Enhancer.txt has 267886 peaks. I think these numbers make sense, right? So now I can annotate this Active_Enhancer.txt file with homer but one thing that I am unable to foresee is how can I associate these active Enhancer to their target genes? As I said that I will associate an active enhancer to up/downstream of 100 kb region, how can I make a list of all putative target of every active enhancer peak (the genes which have their TSS within 100 kb up/down stream of for example below 2 Enhancer peaks)

chr8    145701383       145702686       Rank_1  88      .       5.97704 12.09649        8.84716
chr20   33850820        33852880        Rank_2  78      .       4.68799 10.87448        7.89185
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Bioinformatist Newbie240

For a accurate association of enhancers to genes you'll need to do some Hi-C experimental work. Otherwise typically, you can associate genes to enhancers by taking the closest TSS to the enhancer and calling that it's associated gene.

To do this i'd take a look at the ChIPseeker R package, and look at the annotatePeak function.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Sinji3.0k

Thanks for replying. Experimental data isn't available, I have to do it computationally so the thing I am looking at is given the active Enhancer peak coordinates chr8 145701383 145702686 I get the list of genes whose TSS lies within +/-100kb region. Can I do this thing by ChIPseeker's annotatePeak function ??

ADD REPLYlink written 3.5 years ago by Bioinformatist Newbie240

@Sinji or @Victor: Is it possible to extend the position of peak just in 1 direction. For example, I want to look for active Promoter peaks in 5kb upstream of TSS, so in this case I want to scan the -5kb region (only upstream) of TSS for identifying the active promoter peaks falling in that region. Can you guide me how I can do that?

ADD REPLYlink written 3.5 years ago by Bioinformatist Newbie240

I have tried this and I think it worked fine: bedtools slop -i Promoter_gtf.bed -g hg19.genome -l 5000 -r 0 > slop5k.txt

ADD REPLYlink written 3.5 years ago by Bioinformatist Newbie240

This is fine. I'm assuming you have strand specific information for your genes TSS in which case you will need to add the -s parameter so that your 'upstream' extension is actually correct.

ADD REPLYlink written 3.5 years ago by Sinji3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1719 users visited in the last hour