Question: how to get correlation between the counts over each gene at the same timepoint (two replicates)
1
Lila M 830 wrote:

Hi everybody, I have the counts (obtained by HTSeq) for a lot of genes(~58,000) at different time points (replicates).

``````gene                           t1_S1    t1_S2
ENSG00000000003.14              0        0
ENSG00000000005.5               0        0
ENSG00000000419.12              1        3
[...]
``````

I woul like to calculate the correlation between the counts over each gene at the same timepoint to understand how reproducible the replication timing and progression is for each repeat. Any suggestions?

modified 20 months ago by Nicolas Rosewick9.0k • written 20 months ago by Lila M 830
1

Check out the `cor` function in R. Different kinds of correlation measures are available, including Spearman and Pearson.

1

This is what I am doing, but as I have a huge number of genes, R gets stuck . This is what I'm trying:

``````xx <- read.table(file="matrix_count", sep="\t", header = T)
cor(t(xx), method="pearson")
``````

any other suggestion?

1

Do I understand correctly that you aim to calculate 58000 correlation coefficients?

1

5
Nicolas Rosewick9.0k wrote:

Do you want to test the correlation between the different timepoints or between the different genes.

Let say you have 10 timepoints and 58000 genes

To test the different timepoints :

``````cor(xx, method="pearson")
``````

will give you a 10x10 matrix , so 100 correlations calculation (even though I guess the `cor` function is smart and should not compute twice the `cor` function between col A and col B ; and between col B and col A ; thus 45 correlations should be computed)

To test the different genes (in a pairwise manner) :

``````cor(t(xx), method="pearson")
``````

here a 58,000 x 58,000 matrix , = 3.364e+09 correlations (or 1,681,971,000 correlations if `cor` function is smart). That's why R crashes, it will take to long to compute so many correlations.

Use the coefficent of variation : https://en.wikipedia.org/wiki/Coefficient_of_variation :

``````dat.coeff.var <- apply(dat,1,function(x){sd(x)/mean(x)})
``````
1

Maybe I miss explain what I want. I want to know the correlation for, lets say gene ENSG00000000003.14 in the two replicates, to see if there are differences in each replicate for each gene. I'm not interested in the correlation ENSG00000000003.14 and ENSG00000000005.5. Has more sense?

1

Ok so you want to check the correlation between replicates : then `cor(xx,method="pearson")`

Not exactly, because it gives to me the cor between replicates, and what I want to know is if the counts for the gene ENSG00000000003.14 is different in t1_S1 and t1_S2 (and also for the others genes)

2

Use maybe the coefficent of variation : https://en.wikipedia.org/wiki/Coefficient_of_variation : `dat.coeff.var <- apply(dat,1,function(x){sd(x)/mean(x)})`

1

that's exactly what I want! thanks!

ok great. I modified my answer to archive the right answer. If the answer suits you you can accept the question.

1

There is no correlation for a single pair of measures. The correlation between samples will give you a general view of how similar samples are, and you can plot the values to check outliers. However, you have to take into account sample sequencing depth.