Question

Creating the gene sets for GSEA in R

0

Entering edit mode

8.2 years ago

arronar ▴ 290

Hi.

After statistical analysis (t-test/anova) on microarray data for differential gene expression with respect to a control, I finally have a list with the following information. Gene name , p-value, log fold change.

What i want to do now is to perform a Gene Set Expression Analysis with R. A book that I'm reading says that the first step is to create these gene sets by using Gene Ontology, KEGG or other databases and then run statistics (MLP/KS) to figure out which of these sets are enriched.

My question is how to do this very first step of creating the Gene Sets by having the data I previously mentioned (Gene name , p-value, log fold change). Is there any package function in R that is capable of doing such set unification ?

Thank you.

R microarray GO gene set • 9.0k views

ADD COMMENT • link updated 8.2 years ago by mforde84 ★ 1.4k • written 8.2 years ago by arronar ▴ 290

0

Entering edit mode

Typically a GSEA requires a 'background' gene set of all those expressed in the tissue/cells, and a 'differential' gene set, i.e. those results from your analysis (although ANOVA/t-test for DE sounds sketchy, have you tried limma?). You can select a cut-off p-value and log fold change (typically 0.01, 2 are used respectively). There are plenty of packages for GSEA on bioconductor.

Hope that helps,

Bruce.

ADD REPLY • link 8.2 years ago by bruce.moran ▴ 970

2

Entering edit mode

GSEA does not need a background set, it just need all genes analyzed, and some statistic associated with each gene - typically, log(fold-change).

ADD REPLY • link 8.2 years ago by h.mon 35k

0

Entering edit mode

True, although I still think background sets should be used to limit the sets defining biological processes. If geneX is part of processY, but it isn't found expressed, shouldn't the method be aware of this?

ADD REPLY • link 8.2 years ago by bruce.moran ▴ 970

0

Entering edit mode

Thank you for your reply.

Why sounds sketchy? I run dunnet's test at anova step and then run an FDR correction in my p-values. I have also another list that p-values have been taken from a t-test , also with FDR adjustments. (I couldn't use limma for DE because i didn't have the CEL files)

Anyway.

Here's what is written in book:

Now the idea is to examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general, smaller in magnitude than the overall set of p-values (i.e., the set of p-values for all the genes in G). This involves use of (i) a test statistic to quantify the extent of the difference between the p-values in GS and the p-values in G and (ii) a significance test to judge whether the difference is possibly real or attributable to chance. This process can be repeated for all gene sets of interest.

By reading the phrase "examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general smaller in magnitude than the overall set of p-values " i understand that firstly i have to somehow create the gene set.

Let's say that I found only two gene sets in my listed genes, and each one plays a role in a different biological procedure. The next step is to statistically compare these two gene sets against the whole initial list and see determine if any of these two gene sets is enriched in the treatments.

Is this though right or not ? If it is, then how can someone create these first gene sets in R ? If I'm wrong please let me know and if it's easy, post here some additional resources that might help on this subject.

ADD REPLY • link 8.2 years ago by arronar ▴ 290

1

Entering edit mode

See this response from a leader in the field about t-test/ANOVA/limma issue. 7 years ago but still relevant.

What book is it that you have? I suggest using online resources, and looking on Bioconductor for an appropriate method that makes sense to you.

ADD REPLY • link 8.2 years ago by bruce.moran ▴ 970

0

Entering edit mode

Thank you. Although I tried to use limma, couldn't make it work because i couldn't create proper expression set objects from my data for some reason.

ADD REPLY • link 8.2 years ago by arronar ▴ 290

score 4 · Answer 1 · 2017-04-24

4

Entering edit mode

8.2 years ago

h.mon 35k

Assuming you performed your gene-by-gene analysis with limma (I hope you did), you can follow up with CAMERA to perform enrichment tests. There are some human and mouse gene sets ready to use here. Another package is GAGE, it has functions for creating GO and KEGG gene sets and for performing the tests. Read its documentation, in particular this pdf.

P. S.: I just saw you did not use limma. You would get more helpful answers if you provided more information about your analysis. For example, which organism, and what kind of data you started from. You do not necessarily need CEL files to use limma.

ADD COMMENT • link 8.2 years ago by h.mon 35k

0

Entering edit mode

Here is how my initial data looks like and they are from mouse.

| Ge/treat |   Control_1   | Control_2 |  Cancer_1 | Cancer_2 | Cancer_3 | 
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1    |       2.65    |    3.01   |   2.20    |  3.65    |   4.01   |
| gene2    |       1.54    |    1.27   |   2.01    |  2.65    |   3.11   |
| gene3    |       1.34    |    1.00   |   2.50    |  1.65    |   2.01   |

After dunnet's test I have a table that looks like this (values are not correct of course)

| Ge/treat |   Control_pvalues   | Control_LFC |  Cancer_pvalues | Cancer_LFC   | 
|----------|:-------------------:|------------:|------------------:|-----------:|
| gene1    |             2.65    |      3.01   |           2.20    |  3.65      | 
| gene2    |             1.54    |      1.27   |           2.01    |  2.65      |  
| gene3    |             1.34    |      1.00   |           2.50    |  1.65      |

P.S The link you provided for the pdf is broken.

ADD REPLY • link 8.2 years ago by arronar ▴ 290

0

Entering edit mode

Your initial data is raw intensities? Normalized intensities? Processed with which software? Any quality-filtering steps applied? It should be possible to create any of the intermediary data objects and proceed the analysis using limma, but I don't know how troublesome would that be.

Thanks, I corrected the link.

ADD REPLY • link 8.2 years ago by h.mon 35k

0

Entering edit mode

they are in log2 scale and quantile normalized.

ADD REPLY • link 8.2 years ago by arronar ▴ 290

score 0 · Answer 2 · 2017-04-24

0

Entering edit mode

8.2 years ago

mforde84 ★ 1.4k

Try using EGSEA instead (http://bioconductor.org/packages/release/bioc/html/EGSEA.html).

You can generate geneset lists using: buildKEGGIdx, buildMSigDBIdx, etc.

ADD COMMENT • link 8.2 years ago by mforde84 ★ 1.4k