Single cell RNA-seq guided correlation analysis
1
0
Entering edit mode
3.1 years ago
jevanveen ▴ 20

I would very much appreciate help with what is probably a pretty basic question.

First, the question stated as succinctly as I can:

** Using an a priori list of genes, what is the best way to:

1. Test if their expressions correlate with one another across different RNA- SEQ samples?
2. Rank all other genes from most correlated to my a priori list, to most anti correlated to my a priori list?
3. Get stats and publishable data from 1+2 **

Okay, so background for more context:

I've done RNA-Sequencing (bulk, not single cell) on mouse tumors induced by different oncogenic alleles. Using a standard Salmon/DESeq2/GSEA pipeline I find a convincing looking effect on differentiation status of my tumors - one of my genotypes causes a very widespread loss of markers of terminal differentiation. I got this list of markers from a single cell rna sequencing paper that enumerated a list of 30 cell type markers for each of the major cell types surrounding my tumors. FWIW I was able to validate these results using antibody staining. Lets call this list CellSig30

So, the natural question in my head was what could be happening mechanistically to cause this widespread phenotype. I did some candidate-gene work, but I also started to look at how all the expression of the genes in CellSig30 correlated with each other across all my samples, as well as how CellSig30 correlated or anti-correlated with all other genes in my expression set, in an effort to find causative pathways.

The answer looked pretty cool - 28/30 genes in CellSig30 correlated extremely well with each other, 2 were anti-correlated to some extent. More interesting, to me, was the list of correlated and anti-correlated genes when looking at all genes in my expression set. Top hits made in that list made a lot of sense. In an effort to not do too much list-gazing or candidate gene based stuff, I ranked the new gene list by correlation to CellSig30 and supplied that as a .rnk file to GSEA pre-ranked (using fgsea). The results are very very interesting looking, and I'm following them up now.

So my question is this - I did the correlation analysis in excel (I KNOW, IM SORRY), but I wonder if there is a tried and tested way to do this kind of thing and get statistics? I'm getting more skillful in R so I can handle packages written there. '

Thanks very much for any input you might have, even if it's just that my question is unclear, or that this kind of thing isn't routinely done.

RNA-Seq correlation basics • 2.3k views
1
Entering edit mode
3.1 years ago
munizmom ▴ 60

Hey, For your first question i will suggest that you take a look to WGCNA to study gene co-expression. For the second,you can calculate the Spearman correlation or Pearson and some others for each gene between both conditions in r using library(psych), and the functions cor.plot.upperLowerCi and cor.ci for example. And if you want to rank your genes you can calculate the distances using the get_dist function, and then depending on how you define the rank using the distance information you can order the genes.

0
Entering edit mode

Thank you - I was playing with WGCNA a bit before posting this, but it seems to do whole transcriptome clustering, whereas I want to specifically start with an a priori list that I took from an scRNA seq paper to asses coordinate regulation of cell differentiation state. It kind of sounds like this isn't something that is done super routinely. For second part of your answer - thank you very much as well, I will try to implement these functions to do what i'm attempting here. Cheers! Ed

1
Entering edit mode

Hey, I understand that you want to focus on that list of genes nevertheless you can use WGCNA. What I would do is carry on the WGCNA analysis with all your expression data as recommended in the documentation until you calculate the correlation. Afterwards do two analysis, even if the second one is completely biased on my own opinion: 1. Using the x number of most connected genes to infer the modularity and then check how many of your list of genes are in any module, and identify in which modules are they. 2. Filter all the data to select only your list of interested genes do the modularity analysis and see how they are distributed. IN this case adding other genes that you are sure they should not be co-expressed together will give strength into your results. Anyway I can not recommend this second kind of analysis as you are really biasing and limiting your result with such a strong selection. Using the first approach, will give you enough information and even if all your genes of interest are not in the same module if they are in closer modules you may be able to infer a close expression relationship from that. And also imagine if some of this interesting genes on your list are not really differentially expressed in your data or the correlation level is not strong? With the second approach you are disregarding that maybe some genes on that list are not informative in your data and some other novel ones may be but cannot be assessed because of the analysis you want to implement ...

0
Entering edit mode

Thank you - very helpful insights

0
Entering edit mode

Hey I know much time has passed, but I went with your suggestion #1, it looked great, and now it is part of my published work:

https://elifesciences.org/articles/43668

so thank you very much again! Ed