Question: Microarray Sample Similarity
gravatar for Anomilie
6.9 years ago by
Anomilie0 wrote:


I would greatly appreciate it if you could lend me some advise on the quantitative analysis of sample similarity of microarray data.

The Data:

8 different biological samples, all with 3 technical replicates

Illumina WG-6 microarray, quantile normalised and log2 transformed

Multiple probes summarised into one per gene

Initially I have done hierarchical clustering on the data matrix, using euclidean distances and average linkage. I have also generated an MDS plot where there is clear sample separation. Both the dendogram and MDS plot show the clustering of the samples as is expected because of their tissue of origin. (I'll upload images later)

The global gene expression, as well as a specific preselected subset will be used and result in something like: Sample_A is most similar to Sample_C, Samples_D is more similar to Sample_Z etc. After talking to my good friend google, two main methods seem to be used, correlation and distance.

So these are the analysis steps I performed

1) calculate Spearmans correlation/Euclidean distances between each sample replicate

2) average the replicates (to get a better overview of the data)

3) rank/order remaining data

The two lists produced by each method is listed below.

    Euclidean Distances    
Sample_1_vs_Sample_1    8.003162633
Sample_B_vs_Sample_B    8.111607651
Sample_6_vs_Sample_6    8.51449045
Sample_4_vs_Sample_4    8.684158695
Sample_5_vs_Sample_5    9.022024966
Sample_3_vs_Sample_3    9.349750723
Sample_2_vs_Sample_2    9.903293889
Sample_A_vs_Sample_A    9.966555661
Sample_1_vs_Sample_2    23.21577641
Sample_1_vs_Sample_3    34.83212106
Sample_4_vs_Sample_6    35.14049658
Sample_2_vs_Sample_3    36.54938163
Sample_5_vs_Sample_6    46.86066654
Sample_4_vs_Sample_5    47.83052274
Sample_1_vs_Sample_6    84.95058878
Sample_2_vs_Sample_6    85.03191301
Sample_1_vs_Sample_5    86.31027616
Sample_2_vs_Sample_5    86.74889618
Sample_3_vs_Sample_6    88.33253675
Sample_1_vs_Sample_4    88.8302554
Sample_2_vs_Sample_4    89.010459
Sample_3_vs_Sample_5    89.15878373
Sample_3_vs_Sample_4    92.50254662
Sample_B_vs_Sample_5    94.73304572
Sample_B_vs_Sample_6    96.26289691
Sample_B_vs_Sample_4    97.0321506
Sample_1_vs_Sample_B    98.91472002
Sample_2_vs_Sample_B    99.60447516
Sample_3_vs_Sample_B    100.3718217
Sample_A_vs_Sample_6    145.4080426
Sample_A_vs_Sample_1    147.2187797
Sample_A_vs_Sample_4    147.3384896
Sample_A_vs_Sample_5    147.519752
Sample_A_vs_Sample_3    147.7770987
Sample_A_vs_Sample_2    147.8183657
Sample_A_vs_Sample_B    156.9427804

 Spearman's correlation    
Sample_B_vs_Sample_B    0.974732385
Sample_4_vs_Sample_4    0.963556203
Sample_6_vs_Sample_6    0.958113935
Sample_1_vs_Sample_1    0.957711204
Sample_5_vs_Sample_5    0.957584535
Sample_2_vs_Sample_2    0.956886863
Sample_3_vs_Sample_3    0.953256139
Sample_A_vs_Sample_A    0.943146642
Sample_1_vs_Sample_2    0.928978596
Sample_4_vs_Sample_6    0.924040013
Sample_2_vs_Sample_3    0.916858011
Sample_1_vs_Sample_3    0.916702866
Sample_4_vs_Sample_5    0.912913506
Sample_5_vs_Sample_6    0.90990687
Sample_1_vs_Sample_6    0.855269466
Sample_1_vs_Sample_5    0.854439331
Sample_1_vs_Sample_4    0.853748338
Sample_2_vs_Sample_6    0.852426371
Sample_2_vs_Sample_5    0.851438395
Sample_2_vs_Sample_4    0.851191735
Sample_3_vs_Sample_5    0.840621555
Sample_3_vs_Sample_6    0.840232229
Sample_3_vs_Sample_4    0.8374603
Sample_1_vs_Sample_B    0.835720538
Sample_2_vs_Sample_B    0.835678178
Sample_B_vs_Sample_6    0.835516264
Sample_B_vs_Sample_4    0.834379338
Sample_B_vs_Sample_5    0.830327553
Sample_3_vs_Sample_B    0.82501908
Sample_A_vs_Sample_1    0.750257217
Sample_A_vs_Sample_2    0.748220867
Sample_A_vs_Sample_6    0.743813385
Sample_A_vs_Sample_3    0.743707556
Sample_A_vs_Sample_5    0.743260374
Sample_A_vs_Sample_4    0.741759797
Sample_A_vs_Sample_B    0.714091288

There are some discrepencies between the two lists, but generally they are pretty close.

So my questions essentially are: are my methods statistically justifiable?

Is this the way in which this type of analysis is generally done?

How do I consolidate the data from the two similarity measures?

I have also come across the method of calculating the correlation of correlation, in which first gene wise correlation is calculated and then the resulting values are used to calculate correlation between samples. As described in these papers:

Russ & Futschik - Comparison and consolidation of microarray data sets of human tissue expression

Zheng-Bradley, et al - Large scale comparison of global gene expression patterns in human and mouse

Cope et al - MergeMaid (R implementation in intCor function)

This corCor or Integrative correlation coefficient (IGC) as it is also referred to seems to be mainly used when comparisons are being made across different species, different microarray data sets or studies. However I am wondering whether it also would be appropriate to apply to my data, or whether it would be an overkill.

Any comments, guidelines, advice is greatly appreciated!

ADD COMMENTlink modified 19 months ago by Morteza Razavi0 • written 6.9 years ago by Anomilie0
gravatar for Charles Warden
6.9 years ago by
Charles Warden7.9k
Duarte, CA
Charles Warden7.9k wrote:

I think you have the right idea.

I think the dendrogram is the most important. I would typically produce this with distance = Pearson dissimilarity (so, 1 - correlation coefficient), but Euclidean distance is also OK.

If there are some especially important distances, you can report the distance metric. Strictly speaking, this isn't a statistical measure, but I've used this solution before (see Figure 2A, 2B and Table S2 in the following paper):

In evolutionary biology, there are lots of programs that provide confidence intervals during tree building (where non-significant branches are not shown, meaning that a node can have more than 3 branches). However, I don't recall seeing this for gene expression data. In other words, there is some way to define differences as significant, but it may not be trivial and it still doesn't address the request for a distance metric.

ADD COMMENTlink written 6.9 years ago by Charles Warden7.9k
gravatar for Morteza Razavi
19 months ago by
Iran/ Kharazmi university of Tehran
Morteza Razavi0 wrote:

We can compare the similarity of expression samples using correlation of SSGSEA result. SSGSEA calculates up- and down- regulated genes sets in each sample. (

ADD COMMENTlink written 19 months ago by Morteza Razavi0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1381 users visited in the last hour