Question: Creating the gene sets for GSEA in R
0
gravatar for arronar
24 months ago by
arronar160
Austria
arronar160 wrote:

Hi.

After statistical analysis (t-test/anova) on microarray data for differential gene expression with respect to a control, I finally have a list with the following information. Gene name , p-value, log fold change.

What i want to do now is to perform a Gene Set Expression Analysis with R. A book that I'm reading says that the first step is to create these gene sets by using Gene Ontology, KEGG or other databases and then run statistics (MLP/KS) to figure out which of these sets are enriched.

My question is how to do this very first step of creating the Gene Sets by having the data I previously mentioned (Gene name , p-value, log fold change). Is there any package function in R that is capable of doing such set unification ?

Thank you.

gene set go microarray R • 2.3k views
ADD COMMENTlink modified 24 months ago by mforde841.2k • written 24 months ago by arronar160

Typically a GSEA requires a 'background' gene set of all those expressed in the tissue/cells, and a 'differential' gene set, i.e. those results from your analysis (although ANOVA/t-test for DE sounds sketchy, have you tried limma?). You can select a cut-off p-value and log fold change (typically 0.01, 2 are used respectively). There are plenty of packages for GSEA on bioconductor.

Hope that helps,

Bruce.

ADD REPLYlink written 24 months ago by bruce.moran600
2

GSEA does not need a background set, it just need all genes analyzed, and some statistic associated with each gene - typically, log(fold-change).

ADD REPLYlink written 24 months ago by h.mon24k

True, although I still think background sets should be used to limit the sets defining biological processes. If geneX is part of processY, but it isn't found expressed, shouldn't the method be aware of this?

ADD REPLYlink written 24 months ago by bruce.moran600

Thank you for your reply.

Why sounds sketchy? I run dunnet's test at anova step and then run an FDR correction in my p-values. I have also another list that p-values have been taken from a t-test , also with FDR adjustments. (I couldn't use limma for DE because i didn't have the CEL files)

Anyway.

Here's what is written in book:

Now the idea is to examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general, smaller in magnitude than the overall set of p-values (i.e., the set of p-values for all the genes in G). This involves use of (i) a test statistic to quantify the extent of the difference between the p-values in GS and the p-values in G and (ii) a significance test to judge whether the difference is possibly real or attributable to chance. This process can be repeated for all gene sets of interest.

By reading the phrase "examine the set of p-values {pi : i ∈ G} associated with a particular gene set GS to see whether they are, in general smaller in magnitude than the overall set of p-values " i understand that firstly i have to somehow create the gene set.

Let's say that I found only two gene sets in my listed genes, and each one plays a role in a different biological procedure. The next step is to statistically compare these two gene sets against the whole initial list and see determine if any of these two gene sets is enriched in the treatments.

Is this though right or not ? If it is, then how can someone create these first gene sets in R ? If I'm wrong please let me know and if it's easy, post here some additional resources that might help on this subject.

ADD REPLYlink written 24 months ago by arronar160
1

See this response from a leader in the field about t-test/ANOVA/limma issue. 7 years ago but still relevant.

What book is it that you have? I suggest using online resources, and looking on Bioconductor for an appropriate method that makes sense to you.

ADD REPLYlink written 24 months ago by bruce.moran600

Thank you. Although I tried to use limma, couldn't make it work because i couldn't create proper expression set objects from my data for some reason.

ADD REPLYlink written 24 months ago by arronar160
2
gravatar for h.mon
24 months ago by
h.mon24k
Brazil
h.mon24k wrote:

Assuming you performed your gene-by-gene analysis with limma (I hope you did), you can follow up with CAMERA to perform enrichment tests. There are some human and mouse gene sets ready to use here. Another package is GAGE, it has functions for creating GO and KEGG gene sets and for performing the tests. Read its documentation, in particular this pdf.

P. S.: I just saw you did not use limma. You would get more helpful answers if you provided more information about your analysis. For example, which organism, and what kind of data you started from. You do not necessarily need CEL files to use limma.

ADD COMMENTlink modified 24 months ago • written 24 months ago by h.mon24k

Here is how my initial data looks like and they are from mouse.

| Ge/treat |   Control_1   | Control_2 |  Cancer_1 | Cancer_2 | Cancer_3 | 
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1    |       2.65    |    3.01   |   2.20    |  3.65    |   4.01   |
| gene2    |       1.54    |    1.27   |   2.01    |  2.65    |   3.11   |
| gene3    |       1.34    |    1.00   |   2.50    |  1.65    |   2.01   |

After dunnet's test I have a table that looks like this (values are not correct of course)

| Ge/treat |   Control_pvalues   | Control_LFC |  Cancer_pvalues | Cancer_LFC   | 
|----------|:-------------------:|------------:|------------------:|-----------:|
| gene1    |             2.65    |      3.01   |           2.20    |  3.65      | 
| gene2    |             1.54    |      1.27   |           2.01    |  2.65      |  
| gene3    |             1.34    |      1.00   |           2.50    |  1.65      |

P.S The link you provided for the pdf is broken.

ADD REPLYlink modified 24 months ago • written 24 months ago by arronar160

Your initial data is raw intensities? Normalized intensities? Processed with which software? Any quality-filtering steps applied? It should be possible to create any of the intermediary data objects and proceed the analysis using limma, but I don't know how troublesome would that be.

Thanks, I corrected the link.

ADD REPLYlink written 24 months ago by h.mon24k

they are in log2 scale and quantile normalized.

ADD REPLYlink written 24 months ago by arronar160
0
gravatar for mforde84
24 months ago by
mforde841.2k
mforde841.2k wrote:

Try using EGSEA instead (http://bioconductor.org/packages/release/bioc/html/EGSEA.html).

You can generate geneset lists using: buildKEGGIdx, buildMSigDBIdx, etc.

ADD COMMENTlink written 24 months ago by mforde841.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1605 users visited in the last hour