RNA seq, correlation, gene expression, STAR count
2
0
Entering edit mode
11 weeks ago
Rob ▴ 170

I want to do correlation analysis for STAR count RNA-Seq gene expression data with a continuous variable. What method is preferred? spearman or person?

Thanks

STAR correlation RNA-seq • 575 views
3
Entering edit mode
11 weeks ago

Either the Pearson correlation coefficient or the Spearman rank correlation coefficient are frequently used in correlation analyses between RNA-Seq gene expression data (measured as counts) and a continuous variable. The features of your data will determine whether you use Pearson or Spearman correlation.

Spearman Correlation: When to utilize it: When there could be outliers or when your data is not regularly distributed, apply Spearman correlation. A non-parametric metric called Spearman's rank correlation evaluates the monotonic relationship between variables. Compared to Pearson correlation, it is less susceptible to outliers and does not presume a linear relationship.

Pearson Correlation : When to utilize it: If your data is roughly regularly distributed and devoid of major outliers, apply Pearson correlation. By evaluating the linear relationship between variables and presuming normal distribution of the data, Pearson's correlation is calculated.

0
Entering edit mode

Thanks for your answer. My confusion is: The STAR count gene expression data is harmonized, but not normalized. For the correlation analysis, I normalize data. However, I do not know if peasrson should be used here. IS this normalization I did the same as what was needed for pearson? as the original nature of data was not normalized oroginally.

3
Entering edit mode
11 weeks ago
dsull ★ 5.8k

Use both and report both.

0
Entering edit mode

thanks for your response. But they give me different numbers of genes as significantly correlated. I want to use the genes as classifiers to develop a model. So, I have to choose one method.

2
Entering edit mode

Just choose one and go with it -- each tells you something different about your data and there is no right answer. If this is such a big decision, then try both and see which one performs better on the held-out validation set.

In statistics, there are a lot of decisions where the answer is either "it depends" or "there is no right answer".

I have no idea what your classifier is or what your continuous variable is -- and even if I did, my answer would likely remain the same.

0
Entering edit mode

Thanks for the explanation.