I would very much appreciate help with what is probably a pretty basic question.
First, the question stated as succinctly as I can:
** Using an a priori list of genes, what is the best way to:
- Test if their expressions correlate with one another across different RNA- SEQ samples?
- Rank all other genes from most correlated to my a priori list, to most anti correlated to my a priori list?
- Get stats and publishable data from 1+2 **
Okay, so background for more context:
I've done RNA-Sequencing (bulk, not single cell) on mouse tumors induced by different oncogenic alleles. Using a standard Salmon/DESeq2/GSEA pipeline I find a convincing looking effect on differentiation status of my tumors - one of my genotypes causes a very widespread loss of markers of terminal differentiation. I got this list of markers from a single cell rna sequencing paper that enumerated a list of 30 cell type markers for each of the major cell types surrounding my tumors. FWIW I was able to validate these results using antibody staining. Lets call this list CellSig30
So, the natural question in my head was what could be happening mechanistically to cause this widespread phenotype. I did some candidate-gene work, but I also started to look at how all the expression of the genes in CellSig30 correlated with each other across all my samples, as well as how CellSig30 correlated or anti-correlated with all other genes in my expression set, in an effort to find causative pathways.
The answer looked pretty cool - 28/30 genes in CellSig30 correlated extremely well with each other, 2 were anti-correlated to some extent. More interesting, to me, was the list of correlated and anti-correlated genes when looking at all genes in my expression set. Top hits made in that list made a lot of sense. In an effort to not do too much list-gazing or candidate gene based stuff, I ranked the new gene list by correlation to CellSig30 and supplied that as a .rnk file to GSEA pre-ranked (using fgsea). The results are very very interesting looking, and I'm following them up now.
So my question is this - I did the correlation analysis in excel (I KNOW, IM SORRY), but I wonder if there is a tried and tested way to do this kind of thing and get statistics? I'm getting more skillful in R so I can handle packages written there. '
Thanks very much for any input you might have, even if it's just that my question is unclear, or that this kind of thing isn't routinely done.