Question: will different proportion of control/patient samples affect gene's Pearson correlation?
I have rna-seq data that were from different ages(10, 20, 30, 40, 50 year-old) in 50 control and 14 patients. And based on differential analysis I found some differential genes across age. I want to divide genes into several cluster by using their pearson correlation r for hierarchical clustering, and in each cluster genes should have similar pattern across age, for instance, in control, genes in a cluster were highest at young age, while in patient, it's highest in old ages.

however, there is only a few samples at young ages, and patient sample size is much less than control. I find if I first calculate the mean of each age both in control and in patient, and do clustering based on gene's correlation, the pearson r is different from clustering based on gene's correlation from all samples. will the different size of control and patients, and different size of ages affect the correctness of pearson correlation?

Hello Lucy, I do not completely understand your final paragraph. However, differences in sample numbers will definitely affect the correlation statistic.

If you are aiming to look for 'patterns' in the age groups based on correlation, then tools already exist. These involve the construction of a square correlation matrix, which is then used as the founding stone for network analysis. In a square correlation matrix, each sample is correlated to every other sample:

Thank you Kevin! However i am not sure whether I can use WGCNA, because it may be first calculate gene module by correlation based on control sample, so I think it may not reflect what really happened in disease sample, disease sample module should be different from control module I think.

Okay, why not generate one network for controls and the other for disease? Network analysis, generally, has major flaws. I believe that it still has to prove its value as a robust method that can help us to disentangle disease mechanisms.

