Question: Aggregating Gene Expression Across Rows
gravatar for lkmklsmn
6.0 years ago by
United States
lkmklsmn920 wrote:

I have an gene expression matrix of RNAseq fpkm values, where my rows are genes and my columns are samples. I have a set of genes which I will call my signature. I want a 'signature score' for each of my samples. Summing or averaging doesnt work well since highly expressed genes dominate the score. Can anybody recommend a different approach?


rnaseq gene expression • 4.4k views
ADD COMMENTlink modified 4.2 years ago by oriolebaltimore130 • written 6.0 years ago by lkmklsmn920
gravatar for Devon Ryan
6.0 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

You want GSEA (Gene set enrichment analysis), which is rank based rather than raw score based, and can be used to look at how highly enriched a sample or group is for a specific gene signature. You can either do straight GSEA (you can find the software on the Broad's website) or do single sample GSEA (ssGSEA), so you can see how samples compare against each other. In fact, I'm working on something at the moment that uses that to look at the level of contamination in samples of non-targeted tissues (and this seriously screws up expression analysis, particularly in RNAseq). If you need ssGSEA, just let me know and I can post some R code to do it to save you the trouble of finding it or figuring out how to do it yourself.

Edit: Because I was asked, the original GSEA publication looked mainly at how survival in different cancer types tends to involve recruitment of the same pathways, though you wouldn't find this by looking at individual genes. The general idea behind the method is to see if genes in a given predefined set are up/down-regulated in relationship to something of interest. It seems that this is mainly done in cancer, looking at survivability or cell-type of origin (if you have only a few cell-types of interest and do get a list of DE genes between them, then you would expect a cancer originating from cell-type A to show a more enriched signature for that than a cancer arising from cell-type B. My own use of this is focused more on looking at how a given sample might actually be a mixture of multiple sources (one can use a "signature" of one of the sources to gauge how heavily it's present) and how that fact, when not accounted for, can lead to incorrect DE results (and then how one might correct for this and screen for it ahead of time). At the end of the day, this is quite similar to cancer papers looking at tumour sample purity, like this one.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Devon Ryan94k

Just because I'm curious, could you elaborate a bit on this, or maybe post some references?

ADD REPLYlink written 6.0 years ago by David Westergaard1.4k

I updated with a bit more and a link to the original GSEA paper and another one that also uses ssGSEA. It seems that this is big in oncology, which isn't what I work on so I don't have a bunch of references at hand.

ADD REPLYlink written 6.0 years ago by Devon Ryan94k

Hi Devon,

Just so I understood this correctly, I want to confirm - ssGSEA will help calculate enrichment scores for each sample for a given gene set?

Would it be possible for you to share the link for the R code you mentioned in your answer for ssGSEA?


ADD REPLYlink modified 3 months ago by RamRS26k • written 5.5 years ago by aditi.qamra260

Awesome Ryan, I am doing the "look at the level of contamination in samples of non-targeted tissues" recently. And I want to ask whether it is possible to remove these contamination. I just know we can calculate a gene set score for each patient and I have no idea about the subsequent solution. Can you give me some key words? Do you think the method MMAD can be applied in this part? Thank you!

ADD REPLYlink written 3.1 years ago by ytian0

I haven't a clue what MMAD is and a quick google also wasn't helpful. It's sometimes possible to separate your signal into its "proper source" and "contaminant" components, but it's not always easy. In general, this works better for microarrays. Examples of this would be independent component analysis or any other "signal source separation" methods. There are newer methods that allegedly work better with NGS-style data, but I've generally been unimpressed (I'm not up on the most recent literature though).

ADD REPLYlink written 3.1 years ago by Devon Ryan94k

Sorry. The MMAD method I mentioned is list below But they give few detail about the imput data format or instruction.

ADD REPLYlink written 3.1 years ago by ytian0

What's the primary difference between GSEA and ssGSEA? I'm a bit confused as to what these do and can't seem to find a good resource explaining them. e.g., what exactly is an enrichment score?

ADD REPLYlink written 2.7 years ago by jlkravitz0
gravatar for seidel
6.0 years ago by
United States
seidel7.0k wrote:

I think you have to define what, exactly, the score denotes. Is it an absolute value of some kind? Or is it a value that is relative to something else? One of the most straightforward things that comes to mind is to put your data in log space, to take care of the problem of data spanning several orders of magnitude, and consider that your set of genes represents a vector of values in each data set. Thus you can measure your "signature" in each of your samples by calculating the distance for this vector between any sample and a control sample. Or calculate the distance between all samples. Or form an "average" sample using the mean or median of expression values values across your data set, and calculate the distance score between this "average data set" and each of your samples. All of this results in a value which is not absolute, but is based on properties of your data set. You could even choose some other set of genes within each data set, and create a score for each data set based on your signature set, and some non-signature set not expected to change across conditions (i.e. some set of generally invariant genes). One can think of a million permutations of measured values that would lead to some kind of score. I haven;t mentioned what distance measure you would use, but a euclidean distance is not bad to start with. A key is issue is the relative score/absolute score aspect - thus you would need to think heavily and perhaps provide more information on what you would like to do with this score, or what it actually is supposed to represent.

Another simple option to try is the geneSetTest() function, which offers a p-value for any particular gene set in any set of samples. See the associated references, e.g.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by seidel7.0k
gravatar for oriolebaltimore
4.2 years ago by
United States
oriolebaltimore130 wrote:


How to do single sample GSEA (ssGSEA) on RNA-Seq data from TCGA. We have level 3 data in TCGA with genes , raw counts, FPKM values for each sample.

How to make this data into a GCT file and then use gene pattern to do ssGSEA?

Any suggestions?


ADD COMMENTlink modified 3 months ago by RamRS26k • written 4.2 years ago by oriolebaltimore130
gravatar for Sukhdeep Singh
6.0 years ago by
Sukhdeep Singh10.0k
Sukhdeep Singh10.0k wrote:

Well, if its just about the score, you can make groups of low, medium and high expressed genes and self normalize them according to the highest score in that group (diving by the highest value). This was you generate a score matrix where all the three groups have a max value of 1 and when you seed these values back in you original matrix, you wont see the bias. There might be other statistically strong ways to do that.

ADD COMMENTlink written 6.0 years ago by Sukhdeep Singh10.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour