Question: Creating the gene sets for GSEA in R
0
gravatar for arronar
3.6 years ago by
arronar240
Austria
arronar240 wrote:

Hi.

After statistical analysis (t-test/anova) on microarray data for differential gene expression with respect to a control, I finally have a list with the following information. Gene name , p-value, log fold change.

What i want to do now is to perform a Gene Set Expression Analysis with R. A book that I'm reading says that the first step is to create these gene sets by using Gene Ontology, KEGG or other databases and then run statistics (MLP/KS) to figure out which of these sets are enriched.

My question is how to do this very first step of creating the Gene Sets by having the data I previously mentioned (Gene name , p-value, log fold change). Is there any package function in R that is capable of doing such set unification ?

Thank you.

gene set go microarray R • 3.8k views
ADD COMMENTlink modified 3.6 years ago by mforde841.3k • written 3.6 years ago by arronar240

Typically a GSEA requires a 'background' gene set of all those expressed in the tissue/cells, and a 'differential' gene set, i.e. those results from your analysis (although ANOVA/t-test for DE sounds sketchy, have you tried limma?). You can select a cut-off p-value and log fold change (typically 0.01, 2 are used respectively). There are plenty of packages for GSEA on bioconductor.

Hope that helps,

Bruce.

ADD REPLYlink written 3.6 years ago by bruce.moran860
2

GSEA does not need a background set, it just need all genes analyzed, and some statistic associated with each gene - typically, log(fold-change).

ADD REPLYlink written 3.6 years ago by h.mon31k

True, although I still think background sets should be used to limit the sets defining biological processes. If geneX is part of processY, but it isn't found expressed, shouldn't the method be aware of this?

ADD REPLYlink written 3.6 years ago by bruce.moran860

Thank you for your reply.

Why sounds sketchy? I run dunnet's test at anova step and then run an FDR correction in my p-values. I have also another list that p-values have been taken from a t-test , also with FDR adjustments. (I couldn't use limma for DE because i didn't have the CEL files)

Anyway.

Here's what is written in book:

Now the idea is to examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general, smaller in magnitude than the overall set of p-values (i.e., the set of p-values for all the genes in G). This involves use of (i) a test statistic to quantify the extent of the difference between the p-values in GS and the p-values in G and (ii) a significance test to judge whether the difference is possibly real or attributable to chance. This process can be repeated for all gene sets of interest.

By reading the phrase "examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general smaller in magnitude than the overall set of p-values " i understand that firstly i have to somehow create the gene set.

Let's say that I found only two gene sets in my listed genes, and each one plays a role in a different biological procedure. The next step is to statistically compare these two gene sets against the whole initial list and see determine if any of these two gene sets is enriched in the treatments.

Is this though right or not ? If it is, then how can someone create these first gene sets in R ? If I'm wrong please let me know and if it's easy, post here some additional resources that might help on this subject.

ADD REPLYlink written 3.6 years ago by arronar240
1

See this response from a leader in the field about t-test/ANOVA/limma issue. 7 years ago but still relevant.

What book is it that you have? I suggest using online resources, and looking on Bioconductor for an appropriate method that makes sense to you.

ADD REPLYlink written 3.6 years ago by bruce.moran860

Thank you. Although I tried to use limma, couldn't make it work because i couldn't create proper expression set objects from my data for some reason.

ADD REPLYlink written 3.6 years ago by arronar240
4
gravatar for h.mon
3.6 years ago by
h.mon31k
Brazil
h.mon31k wrote:

Assuming you performed your gene-by-gene analysis with limma (I hope you did), you can follow up with CAMERA to perform enrichment tests. There are some human and mouse gene sets ready to use here. Another package is GAGE, it has functions for creating GO and KEGG gene sets and for performing the tests. Read its documentation, in particular this pdf.

P. S.: I just saw you did not use limma. You would get more helpful answers if you provided more information about your analysis. For example, which organism, and what kind of data you started from. You do not necessarily need CEL files to use limma.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by h.mon31k

Here is how my initial data looks like and they are from mouse.

| Ge/treat |   Control_1   | Control_2 |  Cancer_1 | Cancer_2 | Cancer_3 | 
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1    |       2.65    |    3.01   |   2.20    |  3.65    |   4.01   |
| gene2    |       1.54    |    1.27   |   2.01    |  2.65    |   3.11   |
| gene3    |       1.34    |    1.00   |   2.50    |  1.65    |   2.01   |

After dunnet's test I have a table that looks like this (values are not correct of course)

| Ge/treat |   Control_pvalues   | Control_LFC |  Cancer_pvalues | Cancer_LFC   | 
|----------|:-------------------:|------------:|------------------:|-----------:|
| gene1    |             2.65    |      3.01   |           2.20    |  3.65      | 
| gene2    |             1.54    |      1.27   |           2.01    |  2.65      |  
| gene3    |             1.34    |      1.00   |           2.50    |  1.65      |

P.S The link you provided for the pdf is broken.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by arronar240

Your initial data is raw intensities? Normalized intensities? Processed with which software? Any quality-filtering steps applied? It should be possible to create any of the intermediary data objects and proceed the analysis using limma, but I don't know how troublesome would that be.

Thanks, I corrected the link.

ADD REPLYlink written 3.6 years ago by h.mon31k

they are in log2 scale and quantile normalized.

ADD REPLYlink written 3.6 years ago by arronar240
0
gravatar for mforde84
3.6 years ago by
mforde841.3k
mforde841.3k wrote:

Try using EGSEA instead (http://bioconductor.org/packages/release/bioc/html/EGSEA.html).

You can generate geneset lists using: buildKEGGIdx, buildMSigDBIdx, etc.

ADD COMMENTlink written 3.6 years ago by mforde841.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1708 users visited in the last hour