Question: plotHeatmap for clustering samples
gravatar for jamzaleg84
6 months ago by
jamzaleg8440 wrote:

Hello Everyone,

I've been using deeptools to make heatmaps of ChIP-seq samples and hierarchical clustering of those samples. I have two different transcription factors that I'm interested in and I'm interested in determining if there are regions where each transcription factor binds individually, where there is joint binding, and where in the presence of a mutant transcription factor to we see more binding for the other TF (for example if TF "A" is mutant, in the ChIP-seq of B do we unique regions of binding that we don't see in when the TF A is WT).

So far I made a union files of all peaks by first combining the bam files for all conditions of each transcription factor, then called peaks with MAC2 for each TF, I then used bedtools to take a union of peaks and remove duplicates to come up with a list of about 8,000 sites, in which I used bedtools slop to make sure that the peak regions were all 1kb long, to use as my reference point list for computeMatrix.

computeMatrix reference-point --referencePoint center -R merge_TF1_TF2_slop.bed -S X_mut1_TF1.bigwig X_mut1_TF2.bigwig X_mut2_TF1.bigwig X_mut2_TF2.bigwig X_WT_TF1.bigwig X_WT_TF2.bigwig --skipZero -o matrix_TF1_TF2_unionPeaks.gz

I used "center" as my only other choices were TSS and TES and I wasn't really sure which was the best option. I then plotted the matrix using plotHeatmap with the goal of using k-means or hclust to force it to separate into unique clusters of samples (for example, peaks in sample 1 and 3, but not 2, 4, 5, and 6). The code used for plotHeatmap was

plotHeatmap -m matrix_TF1_TF2_unionPeaks.gz -o matrix_TF1_TF2_unionPeaks.png --outFileSortedRegions matrix_TF1_TF2_cluster.bed --colorMap RdBu --whatToShow 'heatmap and colorbar'  --zMin -2 -2 0 --zMax 2 2 3 --hclust 7

The resultant graph didn't really show any clear clustering of the samples (in the least I anticipated that the TFs would cluster together), but also I didn't really understand the colorbar at the bottom.

As you can see in the heatmap, even though there are clusters, It doesn't seem that the clusters are distinct.

heatmap Does anyone have any suggestions on what exactly I am doing wrong based on my method above. It could very well be that this is just the data and there are no clear clusters, but it seems perplexing based on the motif analysis that I did on these samples suggesting that there are clear overlaps amongst the TF1 and TF2 samples.

Thank you for your time,


tagging: Devon Ryan Devon Ryan

ADD COMMENTlink modified 6 months ago • written 6 months ago by jamzaleg8440

Please post input data, code and expected output or error, to understand the context of the post @ jamzaleg84

ADD REPLYlink written 6 months ago by cpad011214k

Hi there,

Thank you for your response. Do you mean you would like me to post the actual files I used to make the graph? Certainly, but my question was more based on whether or not I made the appropriate reference-point list to do the heatmap-cluster analysis. Or should I have used a different reference-point option like TSS or TES instead of center.

My expected output looks along the lines of this graph from Miranda et al from Cancer Research 2013 Pictured below - notice how the peaks are centered and the clusters are very clear. Again, it could just be my data, but I wanted to make certain I'm doing this correctly.

enter image description here

any suggestions @devonryan?

ADD REPLYlink modified 6 months ago by _r_am31k • written 6 months ago by jamzaleg8440

I'm not sure what your bedSlop command is doing. Since you're using the referencePoint mode, the size of the regions in the BED files won't matter to computeMatrix as it will simply take the bp position in the middle of each region from the BED file, extend it by the number of bp specified via -a and -b (as in "after" and "before" the reference point). In the example plot from Miranda et al. -a and -b should have been both set to 1000.

I agree, currently it looks like the clustering is dominated by the background rather than the binding in the peaks. Did you assess the overlap of the peaks using the BED files and bedtools? Maybe these TFs are completely overlapping? Alternatively, maybe neither of the ChIPs worked and you pulled down generally open TSS or the like.

ADD REPLYlink written 6 months ago by Friederike6.5k

Thank you so much for your reply! I re-did the computeMatrix with the original TF union file (without the Slop) and added 1kb to both ends (picture below). I also looked at all the files (including the inputs) corresponding to all the genes to see if there is any difference between the pull-down and the input, and there are differences though I differ to you to determine if this is the type of difference I should see (picture below). Bedfiles did show clear differences between the TF1 and TF2 pulldowns - for instance motif analysis for TF1 pulldown showed the TF1-response element as the most enriched motif and this also was the case for TF2, but the bed and bigwig files reveals that the data is a little noisy (For instance when I used bedtools intersect to determine unique regions of binding, I found that even in regions where there is a difference, if you looked at the entire gene there would be similar level of binding. All in all, I don't know what to make of it . Do you have any recommendations on how I can determine if the ChIP worked (other than what I already did), because it's a little inconclusive. enter image description here enter image description here

ADD REPLYlink written 6 months ago by jamzaleg8440

I find the colormap rather confusing; intuitively, I want to see bound regions (= positive signal) in red and background in blue. I recommend you either go with the default one or choose a different divergent palette (check here).

Your question about whether the ChIP has worked is, in fact, different from the original question of your post. However, I recommend reading the deepTools notes on this as well. Generally, you can use similar QC metrics for TF-ChIP-seq samples as one would for ATAC-seq, i.e.:

  • number of peaks (usually in the thousands to ten-thousands)
  • distinction between signal and noise/background, can be assessed e.g. via the fingerprint plot.
  • manual inspection of the normalized bigWig files in a Genome Browser, e.g. IGV is also invaluable in judging the results of a ChIP-seq experiment. Ultimately, nothing beats replicates, though.

You can also read more about ChIP-seq QC in the CHANCE paper by Diaz et al. and the corresponding manual; Mendoza-Parra et al. have an interesting approach, too

ADD REPLYlink written 6 months ago by Friederike6.5k

Thank you so much! Yes, sorry I probably should have asked this as a separate question, but it was a natural follow-up question to the point you made about maybe the pull-downs not working. Thank you so much for all your help!


ADD REPLYlink written 6 months ago by jamzaleg8440

Sure thing. Good luck!

ADD REPLYlink written 6 months ago by Friederike6.5k

Hello! I decided to narrow down the list to one mutant and one WT with Pulldowns for TF1 and TF2. As you suggested I made the interval 1kb upstream and downstream the unique called peak site for all conditions. When I call for clustering in the plotHeatmap command it makes nice clusters, but it clearly looks like there is region that is not made into its own cluster that I'm actually interested in (highlighted in red - present nicely in the first two samples and not so much in the 3rd and 4th sample), is there a better way to do the clustering analysis so it could stratify the regions to clusters whereby the clusters show shared binding between some of the samples and not others (For example, found in samples 1, 2, and 3, and not in 4), it seems like when I add the kmeans or hclust option it doesn't make those types of clusters. Here is the command I ran:

plotHeatmap -m $VO -o "${file_name}".png --outFileSortedRegions "${file_name}"_outfile.bed --colorMap Purples --whatToShow 'heatmap and colorbar'  --zMin -4 --zMax 4 --kmeans 8

Here is the plot with the highlighted region of interest enter image description here

ADD REPLYlink written 6 months ago by jamzaleg8440

You "just" need to play around with the number of clusters, either via the kmeans or hclust option. Increasing the region size may also help with focusing the signal visually.

ADD REPLYlink written 6 months ago by Friederike6.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 991 users visited in the last hour