I would greatly appreciate it if you could lend me some advise on the quantitative analysis of sample similarity of microarray data.
8 different biological samples, all with 3 technical replicates
Illumina WG-6 microarray, quantile normalised and log2 transformed
Multiple probes summarised into one per gene
Initially I have done hierarchical clustering on the data matrix, using euclidean distances and average linkage. I have also generated an MDS plot where there is clear sample separation. Both the dendogram and MDS plot show the clustering of the samples as is expected because of their tissue of origin. (I'll upload images later)
The global gene expression, as well as a specific preselected subset will be used and result in something like: Sample_A is most similar to Sample_C, Samples_D is more similar to Sample_Z etc. After talking to my good friend google, two main methods seem to be used, correlation and distance.
So these are the analysis steps I performed
1) calculate Spearmans correlation/Euclidean distances between each sample replicate
2) average the replicates (to get a better overview of the data)
3) rank/order remaining data
The two lists produced by each method is listed below.
Euclidean Distances Sample_1_vs_Sample_1 8.003162633 Sample_B_vs_Sample_B 8.111607651 Sample_6_vs_Sample_6 8.51449045 Sample_4_vs_Sample_4 8.684158695 Sample_5_vs_Sample_5 9.022024966 Sample_3_vs_Sample_3 9.349750723 Sample_2_vs_Sample_2 9.903293889 Sample_A_vs_Sample_A 9.966555661 Sample_1_vs_Sample_2 23.21577641 Sample_1_vs_Sample_3 34.83212106 Sample_4_vs_Sample_6 35.14049658 Sample_2_vs_Sample_3 36.54938163 Sample_5_vs_Sample_6 46.86066654 Sample_4_vs_Sample_5 47.83052274 Sample_1_vs_Sample_6 84.95058878 Sample_2_vs_Sample_6 85.03191301 Sample_1_vs_Sample_5 86.31027616 Sample_2_vs_Sample_5 86.74889618 Sample_3_vs_Sample_6 88.33253675 Sample_1_vs_Sample_4 88.8302554 Sample_2_vs_Sample_4 89.010459 Sample_3_vs_Sample_5 89.15878373 Sample_3_vs_Sample_4 92.50254662 Sample_B_vs_Sample_5 94.73304572 Sample_B_vs_Sample_6 96.26289691 Sample_B_vs_Sample_4 97.0321506 Sample_1_vs_Sample_B 98.91472002 Sample_2_vs_Sample_B 99.60447516 Sample_3_vs_Sample_B 100.3718217 Sample_A_vs_Sample_6 145.4080426 Sample_A_vs_Sample_1 147.2187797 Sample_A_vs_Sample_4 147.3384896 Sample_A_vs_Sample_5 147.519752 Sample_A_vs_Sample_3 147.7770987 Sample_A_vs_Sample_2 147.8183657 Sample_A_vs_Sample_B 156.9427804 Spearman's correlation Sample_B_vs_Sample_B 0.974732385 Sample_4_vs_Sample_4 0.963556203 Sample_6_vs_Sample_6 0.958113935 Sample_1_vs_Sample_1 0.957711204 Sample_5_vs_Sample_5 0.957584535 Sample_2_vs_Sample_2 0.956886863 Sample_3_vs_Sample_3 0.953256139 Sample_A_vs_Sample_A 0.943146642 Sample_1_vs_Sample_2 0.928978596 Sample_4_vs_Sample_6 0.924040013 Sample_2_vs_Sample_3 0.916858011 Sample_1_vs_Sample_3 0.916702866 Sample_4_vs_Sample_5 0.912913506 Sample_5_vs_Sample_6 0.90990687 Sample_1_vs_Sample_6 0.855269466 Sample_1_vs_Sample_5 0.854439331 Sample_1_vs_Sample_4 0.853748338 Sample_2_vs_Sample_6 0.852426371 Sample_2_vs_Sample_5 0.851438395 Sample_2_vs_Sample_4 0.851191735 Sample_3_vs_Sample_5 0.840621555 Sample_3_vs_Sample_6 0.840232229 Sample_3_vs_Sample_4 0.8374603 Sample_1_vs_Sample_B 0.835720538 Sample_2_vs_Sample_B 0.835678178 Sample_B_vs_Sample_6 0.835516264 Sample_B_vs_Sample_4 0.834379338 Sample_B_vs_Sample_5 0.830327553 Sample_3_vs_Sample_B 0.82501908 Sample_A_vs_Sample_1 0.750257217 Sample_A_vs_Sample_2 0.748220867 Sample_A_vs_Sample_6 0.743813385 Sample_A_vs_Sample_3 0.743707556 Sample_A_vs_Sample_5 0.743260374 Sample_A_vs_Sample_4 0.741759797 Sample_A_vs_Sample_B 0.714091288
There are some discrepencies between the two lists, but generally they are pretty close.
So my questions essentially are: are my methods statistically justifiable?
Is this the way in which this type of analysis is generally done?
How do I consolidate the data from the two similarity measures?
I have also come across the method of calculating the correlation of correlation, in which first gene wise correlation is calculated and then the resulting values are used to calculate correlation between samples. As described in these papers:
Zheng-Bradley, et al - Large scale comparison of global gene expression patterns in human and mouse
Cope et al - MergeMaid (R implementation in intCor function)
This corCor or Integrative correlation coefficient (IGC) as it is also referred to seems to be mainly used when comparisons are being made across different species, different microarray data sets or studies. However I am wondering whether it also would be appropriate to apply to my data, or whether it would be an overkill.
Any comments, guidelines, advice is greatly appreciated!