Question: Why some tumors are grouped with the controls?
gravatar for Pin.Bioinf
4 weeks ago by
Pin.Bioinf230 wrote:

Hello, I am looking at this heatmap and I do not understand why some of the tumours are grouped with the controls, seems as if the heatmap is 'moved to the left': Top of the heatmap: Top of the heatmap

(it is a long heatmap so i upload just a section)

Bottom of the heatmap: Bottom of the heatmap

I tried to check on the phenotype of the samples but I found no correlations of the phenotype with this grouping.

There seem to be two tumour subgroups: one that is the biggest, and the small group of tumors that seem to be grouped with the controls. I did a t-test on the mean beta values for all the cpgs between these two groups, and it turns out there is significance among the means of these two tumour groups. I am afraid when we publish a heatmap similar to this one, we will have trouble explaining this phenomenon. Any ideas or any opinions on this? Thank you!

methylation cpgs • 163 views
ADD COMMENTlink modified 4 weeks ago by Kevin Blighe37k • written 4 weeks ago by Pin.Bioinf230
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

There can be many reasons, some not really relating to bioinformatics:

  1. these tumours genuinely exhibit the 'normal-like' methylation profile over these probes
  2. these tumours have normal cell contamination

Some informatics reasons:

  • your coding is incorrect and you have incorrectly assigned a normal sample as a tumour
  • you have scaled your data incorrectly
  • you should consider the distance and linkage metric that you're using

By the way, for methylation, you could probably also perform the Wilcoxon Signed Rank test on the matched T-N pairs. If just a regular t-test, I would at least use the Mann Whitney (non-parametric) test.

You must also have an extra checkpoint: you should obtain the difference in mean β value between tumour and normal, i.e.:

difference in mean = mean β (tumour) - mean β (normal)

Then, use that as an extra cut-off in addition to the p-value.


ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe37k

Thanks a lot, kevin!

For scaling, I tried many ways and got the same for all: -Converting to M values and then using parameter scale='row'; -Using scale='row' on b values; -Using scale='none' but scaling before the heatmap.

All look similar, so I think I did it correctly. I will check on distance and linkage metrics, as I used the default ones. And thank you for the test input, I now realized I should have used a non parametric test.

ADD REPLYlink written 4 weeks ago by Pin.Bioinf230

When I previously did this, the Wilcoxon Signed Rank test p-value, combined with an extra cut-off for difference in mean β, were enough to adequately separate my groups of interest. The heatmap / clustering was then performed on unscaled β values:


ADD REPLYlink modified 6 days ago • written 4 weeks ago by Kevin Blighe37k

I tried with :

hclustfun = function(x) hclust(x,method = 'centroid'),
distfun = function(x) dist(x,method = 'maximum'),

and here is the heatmap I got: seems much better, should I stick with this one?

Heatmap with different hclust and dist methods

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Pin.Bioinf230

I was more interested in just learning about which metrics you were currently using. I would not use a metric that I did not understand, and obviously it is not good practice to just choose the metric that makes the data look better.

My usual default (for most data-types) is either of:

  • Euclidean distance with Ward's linkage (ward.D2)
  • 1 minus Pearson/Spearman correlation distance with Ward's linkage (ward.D2)
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe37k

Yes, I agree. I was using default parameters, which were "complete" and "euclidean". I tried euclidean with ward.D2 and still, the differentiation between tumor and controls is not clear, it is a blur in the middle of both, and some tumors are the same color as controls as in my original heatmap.

ADD REPLYlink written 4 weeks ago by Pin.Bioinf230

...but this may not necessarily be a problem, i.e., it may be the genuine result. Biology is much more complex than we can currently comprehend with our analytical methods. Every time that we take a sample and put it through our instruments, we are only looking at a 'snapshot' / moment in the evolution of the tissue/cell that is being studied, and much information is automatically eliminated because our very analytical methods are limited in what they can show.

So, if you cannot identify any issues with your coding, then it is the genuine result given the data that has been obtained.

I should add that you need to both filter by p-value and the difference in mean between tumour and normal.

ADD REPLYlink written 4 weeks ago by Kevin Blighe37k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 900 users visited in the last hour