I've been stuck on a problem for a while and I really hope that someone here can help me, so here goes! I have RNAseq data from 3 different types of samples, of which one is a disease tissue (n=44 (there might be sub-groups within the group of disease samples) and two are potential controls (n=16 and n=4). I have been looking for a way to say which of the two potential controls is more similar to the disease tissue.
My approach thus far has been the following: First of all, I need some kind of value that represents the similarity between each of the samples. I've chosen Pearson correlation to compare each sample with each other, since as far as I'm aware it is the most used type of correlation for this kind of data. Do you agree that I should use Pearson correlation?
Then, to be able to say which control is more similar to the disease tissue, I've grouped the correlation coefficients into two groups, one describing the correlation coefficient between each of the disease samples and each of the control samples in one of the two control groups, the other describing the correlation coefficient between each of the disease samples and each of the control samples in the other control group. I end up with 704 (16 times 44) correlation coefficients for one, and 176 (4 times 44) for the other. Then I've compared the correlation coefficients between these two groups of 704 and 176 correlation coefficients using the Mann-Whitney U test (both two-sided and one-sided) to be able to say which group of correlation coefficients is 'higher' and thus which group of control tissue is more similar to the disease tissue.
I would like to get some input on whether you think this is an OK approach for my question.
Also I am wondering if there is any way that I can more directly compare the groups, without calculating Pearson correlation coefficients between each of the samples first, to be able to say something about their similarity?
Thank you so much in advance for your input!