Hi!

I've been stuck on a problem for a while and I really hope that someone here can help me, so here goes! I have RNAseq data from 3 different types of samples, of which one is a disease tissue (n=44 (there might be sub-groups within the group of disease samples) and two are potential controls (n=16 and n=4). I have been looking for a way to say which of the two potential controls is more similar to the disease tissue.

My approach thus far has been the following: First of all, I need some kind of value that represents the similarity between each of the samples. I've chosen Pearson correlation to compare each sample with each other, since as far as I'm aware it is the most used type of correlation for this kind of data. Do you agree that I should use Pearson correlation?

Then, to be able to say which control is more similar to the disease tissue, I've grouped the correlation coefficients into two groups, one describing the correlation coefficient between each of the disease samples and each of the control samples in one of the two control groups, the other describing the correlation coefficient between each of the disease samples and each of the control samples in the other control group. I end up with 704 (16 times 44) correlation coefficients for one, and 176 (4 times 44) for the other. Then I've compared the correlation coefficients between these two groups of 704 and 176 correlation coefficients using the Mann-Whitney U test (both two-sided and one-sided) to be able to say which group of correlation coefficients is 'higher' and thus which group of control tissue is more similar to the disease tissue.

I would like to get some input on whether you think this is an OK approach for my question.

Also I am wondering if there is any way that I can more directly compare the groups, without calculating Pearson correlation coefficients between each of the samples first, to be able to say something about their similarity?

Thank you so much in advance for your input!

Thank you for your reply. Can you recommend me a package to do PCA analysis and do you have some tips? I've never done PCA analysis before. And how do you calculate sample distance with DESeq2? I've been using edgeR for my analysis is there any way to do this also in edgeR that you know of? Thanks again!

I have raw gene counts, normalized counts calculated with DESeq and RPKM values.

I've tried today to do PCA, but I'm not sure I'm doing it right. I've used my DESeq normalized counts, log2 transformed them and added 1 to the counts that were 0 (as to not create infinite negatives. do you have a better way to deal with this?).

I've used the code from this post: PCA plot from read count matrix from RNA-Seq But first of all, my PC1 and PC2 do not explain that much of the variation (21.82% and 10.62% respectively). Is this a problem? Then, I don't know how to label my data (they belong to three groups) like they do in the above mentioned post where I took the code from. All my 'dots' in the graphs are just black.

Lastly, do you know how an MDS plot (which I've generated when I used edgeR) relates to a plot from PCA?

Thanks so much in advance!

This workflow from the DESeq2 authors will provide a very good foundation for your analysis. MDS is similar to PCA, though not exactly the same. As for how much variation they explain, it really depends. I don't know how different you expect the samples to be. Those values may still be perfectly reasonable.

Thanks Jared, I will have a look at the workflow!

I don't expect my samples to be that different, I don't know if that would make sense with a low percentage of variation explained by PC1 and PC2, or that you would expect the opposite?

Thanks!

That just indicates that there isn't one component that really explains all the variation between the different groups. I wouldn't be concerned with those values much, PCA is best used as a tool to peruse inter and intra-group variation so you can account for any batch effects and model your comparisons appropriately, imo.

Okay thanks for the explanation!