Question

Aggregating Gene Expression Across Rows

1

Entering edit mode

11.3 years ago

lkmklsmn ▴ 980

Hi,
I have an gene expression matrix of RNAseq fpkm values, where my rows are genes and my columns are samples. I have a set of genes which I will call my signature. I want a 'signature score' for each of my samples. Summing or averaging doesnt work well since highly expressed genes dominate the score. Can anybody recommend a different approach?

Thanks

rnaseq gene expression • 6.8k views

ADD COMMENT • link updated 9.5 years ago by oriolebaltimore ▴ 190 • written 11.3 years ago by lkmklsmn ▴ 980

Ram · Answer 1 · 2014-03-20

4

Entering edit mode

11.3 years ago

Devon Ryan 105k

You want GSEA (Gene set enrichment analysis), which is rank based rather than raw score based, and can be used to look at how highly enriched a sample or group is for a specific gene signature. You can either do straight GSEA (you can find the software on the Broad's website) or do single sample GSEA (ssGSEA), so you can see how samples compare against each other. In fact, I'm working on something at the moment that uses that to look at the level of contamination in samples of non-targeted tissues (and this seriously screws up expression analysis, particularly in RNAseq). If you need ssGSEA, just let me know and I can post some R code to do it to save you the trouble of finding it or figuring out how to do it yourself.

Edit: Because I was asked, the original GSEA publication looked mainly at how survival in different cancer types tends to involve recruitment of the same pathways, though you wouldn't find this by looking at individual genes. The general idea behind the method is to see if genes in a given predefined set are up/down-regulated in relationship to something of interest. It seems that this is mainly done in cancer, looking at survivability or cell-type of origin (if you have only a few cell-types of interest and do get a list of DE genes between them, then you would expect a cancer originating from cell-type A to show a more enriched signature for that than a cancer arising from cell-type B. My own use of this is focused more on looking at how a given sample might actually be a mixture of multiple sources (one can use a "signature" of one of the sources to gauge how heavily it's present) and how that fact, when not accounted for, can lead to incorrect DE results (and then how one might correct for this and screen for it ahead of time). At the end of the day, this is quite similar to cancer papers looking at tumour sample purity, like this one.

ADD COMMENT • link 11.3 years ago by Devon Ryan 105k

0

Entering edit mode

Just because I'm curious, could you elaborate a bit on this, or maybe post some references?

ADD REPLY • link 11.3 years ago by David Westergaard ★ 1.5k

0

Entering edit mode

I updated with a bit more and a link to the original GSEA paper and another one that also uses ssGSEA. It seems that this is big in oncology, which isn't what I work on so I don't have a bunch of references at hand.

ADD REPLY • link 11.3 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon,

Just so I understood this correctly, I want to confirm - ssGSEA will help calculate enrichment scores for each sample for a given gene set?

Would it be possible for you to share the link for the R code you mentioned in your answer for ssGSEA?

Thanks!

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 10.8 years ago by aditi.qamra ▴ 270

0

Entering edit mode

Awesome Ryan, I am doing the "look at the level of contamination in samples of non-targeted tissues" recently. And I want to ask whether it is possible to remove these contamination. I just know we can calculate a gene set score for each patient and I have no idea about the subsequent solution. Can you give me some key words? Do you think the method MMAD can be applied in this part? Thank you!

ADD REPLY • link 8.4 years ago by ytian • 0

0

Entering edit mode

I haven't a clue what MMAD is and a quick google also wasn't helpful. It's sometimes possible to separate your signal into its "proper source" and "contaminant" components, but it's not always easy. In general, this works better for microarrays. Examples of this would be independent component analysis or any other "signal source separation" methods. There are newer methods that allegedly work better with NGS-style data, but I've generally been unimpressed (I'm not up on the most recent literature though).

ADD REPLY • link 8.4 years ago by Devon Ryan 105k

0

Entering edit mode

Sorry. The MMAD method I mentioned is list below https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt566 But they give few detail about the imput data format or instruction.

ADD REPLY • link 8.4 years ago by ytian • 0

0

Entering edit mode

What's the primary difference between GSEA and ssGSEA? I'm a bit confused as to what these do and can't seem to find a good resource explaining them. e.g., what exactly is an enrichment score?

ADD REPLY • link 7.9 years ago by jlkravitz • 0

Ram · Answer 2 · 2016-01-17

2

Entering edit mode

9.5 years ago

oriolebaltimore ▴ 190

Hi:

How to do single sample GSEA (ssGSEA) on RNA-Seq data from TCGA. We have level 3 data in TCGA with genes , raw counts, FPKM values for each sample.

How to make this data into a GCT file and then use gene pattern to do ssGSEA?

Any suggestions?

Thanks
Adrian

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by oriolebaltimore ▴ 190

score 1 · Answer 3 · 2014-03-20

I think you have to define what, exactly, the score denotes. Is it an absolute value of some kind? Or is it a value that is relative to something else? One of the most straightforward things that comes to mind is to put your data in log space, to take care of the problem of data spanning several orders of magnitude, and consider that your set of genes represents a vector of values in each data set. Thus you can measure your "signature" in each of your samples by calculating the distance for this vector between any sample and a control sample. Or calculate the distance between all samples. Or form an "average" sample using the mean or median of expression values values across your data set, and calculate the distance score between this "average data set" and each of your samples. All of this results in a value which is not absolute, but is based on properties of your data set. You could even choose some other set of genes within each data set, and create a score for each data set based on your signature set, and some non-signature set not expected to change across conditions (i.e. some set of generally invariant genes). One can think of a million permutations of measured values that would lead to some kind of score. I haven;t mentioned what distance measure you would use, but a euclidean distance is not bad to start with. A key is issue is the relative score/absolute score aspect - thus you would need to think heavily and perhaps provide more information on what you would like to do with this score, or what it actually is supposed to represent.

Another simple option to try is the geneSetTest() function, which offers a p-value for any particular gene set in any set of samples. See the associated references, e.g. http://www.ncbi.nlm.nih.gov/pubmed/22638577

score 0 · Answer 4 · 2014-03-20

Well, if its just about the score, you can make groups of low, medium and high expressed genes and self normalize them according to the highest score in that group (diving by the highest value). This was you generate a score matrix where all the three groups have a max value of 1 and when you seed these values back in you original matrix, you wont see the bias. There might be other statistically strong ways to do that.