how to calculate correlations between sparse data
1
2
Entering edit mode
11 days ago
leranwangcs ▴ 140

Hi,

I have a continuous variable A which is not parse, and I have a group of continuous variables which are very sparse (some of them have only one non-0 value). I want to calculate the correlations between variable A vs each of the variable in the group. I used cor.test() from r package stats, in which the default test is Pearson test. However the results look not very trustable. One variable that has only one non-0 value shows the most significant correlation with the variable A based on the p value.

I wondered if I'm using the wrong test on this type of data? What is a better way to calculate their corelations?

Thanks!

Correlations • 410 views
1
Entering edit mode

Not sure if this has foundation in statistics.

I suggest you try doing a singular value decomposition on both datasets, then take the first 10 components and calculate the correlations of those vectors.

0
Entering edit mode

Hmm, perhaps try a distance metric like mean squared deviation?

0
Entering edit mode

Thanks for the suggestion! Could you please give me some more details on how to do this?

Thanks so much!

0
Entering edit mode

You will need truncated SVD for sparse data. Have your data matrix, select the number of components (I suggest 5-10), and that is pretty much it.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

0
Entering edit mode
11 days ago
Jeremy ▴ 910

I would suggest setting method = 'spearman', which can detect non-linear correlations. With MSE, I think variables with more non-zero values will have a shorter distance, but I'm not sure that would really measure correlation.

2
Entering edit mode

Spearman doesn't work well with sparsity -- it is based on ranking and if you have a bunch of zeroes, it's hard to rank. Kendall tau works better for a nonparametric approach I think.

The issue doesn't appear to be because of linearity, it appears to be because of sparsity.

Distance metrics are nice for measuring associations. If you look at the formula for R^2, it is actually a standardized version of the MSE, so I might suggest trying out different distance metrics.