I also made the same post in the Bioconductor Forum.
From our experiments in cell lines we identified genes and exons differentially regulated upon knockdown of a splicing factor using DESeq2 and MAJIQ. Now we want to transfer these findings to TCGA patients with hepatocellular carcinoma (TCGA-LIHC) using the recount3 package. The aim is to (1st task) correlate the expression of the splicing factor with the expression of the differentially regulated genes and (2nd task) to correlate the expression of the splicing factor with the PSI values of the differentially regulated exons using all 371 TCGA-LIHC patients. The calculation of the PSI values works fine and is not relevant in this post. My main problem is choosing the correct normalization strategy for the gene expression estimation.
In a first attempt for the 1st task TPM values for all genes in each of the 371 LIHC patients was calculated.
library(recount) library(recount3) human_projects <- available_projects() project_info <- subset(human_projects, project == "LIHC" & file_source == "tcga") rse_recount3 <- create_rse(project_info) rse_recount3 <- rse_recount3[,rse_recount3$tcga.gdc_cases.samples.sample_type == "Primary Tumor"] #Create TPM values assay(rse_recount3, "counts") <- transform_counts(rse_recount3) TPM.m <- getTPM(rse_recount3, length_var = "score")
As a first check I made pairwise-correlations between my splicing factor and all other genes (log-transformed TPM values), followed by the generation of density distribution of all ~20k pearson correlation coefficients obtained from the pairwise-correlations. What I observed was a completely skewed distribution shifted to the right (median ~0.4 to 0.5), which indicates that my splicing factor shows a positive correlation with most of all other genes. These results looked quite suspicious. Therefore, I picked 10.000 random gene-pairs and correlated them, followed by the generation of the density distribution. And once again, although to a lesser extent, the distribution was also shifted to the right (median ~0.2 to 0.25). It seems that some confounder is affecting the correlation analysis, which can not be captured by TPM normalization.
In a second attempt, I wanted to try another normalization strategy that takes into account differences in transcriptome composition, namely the combination of TMM and cpm from the edgeR package.
y <- DGEList(counts= assays(rse_recount3)$counts, remove.zeros=T) y <- calcNormFactors(y) CPM.m <- cpm(y, log=TRUE) %>% t
Like for the first attempt I made the correlations and generated the density distributions. Now both distributions (my splicing factor against all other genes and the 10k random pairs) are centered at around 0 and follow a normal distribution.
I also checked the correlation between my splicing factor and one gene of interest and observed a tremendous difference between the TPM and CPM normalization. For TPM I observed a quite strong positive correlation of 0.42, while for CPM I observed a weak negative correlation of -0.23.
So, here are my questions:
1) Which normalization strategy should I use? As far as I understand it TPM is good for intra-sample comparisons among genes but should not be used for inter-sample comparisons. In addition, the density distributions for the TPM approach were strongly skewed and shifted to the right, wich give me the feeling that I should go for the CPM normalization.
2) The TMM+CPM approach does not account for gene lengths. However, I think this should not be a problem for my correlation analyses. But what I could do is using the rpkm() method provided by edgeR, which converts my logCPM values to RPKM. Would this make sense?
Thanks in advace.