For GSEA, check the example file formats to get an idea of the formatting. I recently used the JAVA implementation of GSEA for the first time and got it working.
cls file
Contains information on factors in our data. 35 7 means, in this case, 35 samples and 7 unique levels for the listed factor. On the third line of the file, we list the actual levels as they relate to our samples - these should line up to the columns in the gct file.
NB - these are space-delimited.
35 7 1
# d0 d1 d2 d4 d6 d8 d10
d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10 d0 d1 d2 d4 d6 d8 d10
gct file
This contains the expression values. You need a NAME and DESCRIPTION column before the counts values actually start. Description can be just na. Again, note the header information, here, 18062 genes X 35 samples.
NB - these are tab-delimited.
#1.0
18062   35
NAME    DESCRIPTION Day 0, rep 1    Day 1, rep 1    Day 2, rep 1    Day 4, rep 1    Day 6, rep 1    Day 8, rep 1    Day 10, rep 1   Day 0, rep 2    Day 1, rep 2    Day 2, rep 2    Day 4, rep 2    Day 6, rep 2    Day 8, rep 2    Day 10, rep 2   Day 0, rep 3    Day 1, rep 3    Day 2, rep 3    Day 4, rep 3    Day 6, rep 3    Day 8, rep 3    Day 10, rep 3   Day 0, rep 4    Day 1, rep 4    Day 2, rep 4    Day 4, rep 4    Day 6, rep 4    Day 8, rep 4    Day 10, rep 4   Day 0, rep 5    Day 1, rep 5    Day 2, rep 5    Day 4, rep 5    Day 6, rep 5    Day 8, rep 5    Day 10, rep 5
A1BG    na  -1.78750107249577   -1.78731965121805   -1.78739011815182   -1.78648292007421   -1.78825323052185   -1.75670265819045   -1.7856669206048    -1.78652518885366   -1.78682730267777   -1.78980334199807   -1.78644486265833   -1.7868860041479    -1.78844156465141   -1.78740712853483   -1.75644423399062   -1.78612773069836   -1.78929036918159   -1.78723396224438   -1.76697481762272   -1.78693195908128   -1.78629510548009   -1.78470994669637   -1.78615883408804   -1.75804087324122   -1.78652254894815   -1.78711039289089   -1.76833202023458   -1.78672978697874   -1.7850823437463    -1.78625577998891   -1.78670342516185   -1.78584154361388   -1.78728728194433   -1.78497558588491   -1.78644925915904
A1CF    na  1.68492754186313    1.54066315490874    1.54006231864025    1.51816007039476    1.60513517299563    1.5837019048566 1.61600434016912    1.51769932951262    1.60421752506403    1.56906960878706    1.65730147755638    1.57148034912919    1.64703379520972    1.54022084471361    1.61967950619213    1.51949572547524    1.52562157884476    1.540660774612  1.54957287190596    1.48702357593441    1.54796402052754    1.59524718481615    1.48932230313822    1.60079524224128    1.75736087058801    1.51447655944983    1.61715833564219    1.60452069557156    1.52619397748714    1.48902853362178    1.57432099780454    1.64145506694909    1.56773033915297    1.52760402017735    1.65905159731629
gmt file
Contains the signatures:
GO_CELL_REDOX_HOMEOSTASIS   http://software.broadinstitute.org/gsea/msigdb/cards/GO_CELL_REDOX_HOMEOSTASIS.html PDIA6   TXNDC9  GLRX3   PRDX4   TXNRD2  PDIA5   EGLN2   TXNRD3  AIFM3   CYBA    CYBB    DDIT3   QSOX2   DLD PDILT   ERP44   DNAJC16 NNT TXNDC8  TXN2    GCLC    GLRX    GPX1    PDIA3   GSR ERO1L   APEX1   NME9    IL6 GRXCR1  LTF NCF2    NCF4    NFE2L2  NOS1    NOS2    NOS3    P4HB    GLRX2   TXNDC12 TXNDC11 TMX2    GLRX5   TXNDC3  DNAJC10 TMX3    SELS    TMX4    ERO1LB  TXNDC16 QSOX1   PDIA2   NCF1    SLC11A1 TXN TXNRD1  TXNDC15 PTGES2  TMX1    TXNDC5  CAMP    SH3BGRL3    TXNDC2  KRIT1   AIFM1   TXNL1   PDIA4
GO_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_IN_RESPONSE_TO_ENDOPLASMIC_RETICULUM_STRESS    http://software.broadinstitute.org/gsea/msigdb/cards/GO_INTRINSIC_APOPTOTIC_SIGNALING_PATHWAY_IN_RESPONSE_TO_ENDOPLASMIC_RETICULUM_STRESS.html  CASP12  CEBPB   DAB2IP  DDIT3   ERN1    PPP1R15A    BBC3    GSK3B   ERO1L   UBE2K   APAF1   ITPR1   MAP3K5  ATF4    ATP2A1  PMAIP1  PML DNAJC10 TRIB3   BAK1    BAX SELK    BCL2    TMBIM6  TRAF2   XBP1    CHAC1   BAG6    CASP4   TNFRSF10B   BRSK2   AIFM1
NB - these are tab-delimited.
----------------------------------------
Kevin
                    
                
                 
Thank you @Kevin. I am new for this kind of analysis. In total I have 8 samples (4 treated and 4 untreated) with 3 replicates.
I am confused now which expression values I have to give in the gct file? Please help me in this regard.
Hey, you should go one step more to produce the rlog or vst counts, and then use those in the gct file.
Thank you. I have used this code:
I obtained this file:
Should I use these values? Also I don't know how to create gmt file. Thank you
Thank you, but remember that you require this format:
You need an extra column for
DESCRIPTIONThank you Kevin, It is working now. I have downloaded the gene data sets files for Arabidopsis thaliana from the website enter link description here. Is this right to use that? It shows error when the GMT formatted file for all gene sets is uploaded. But works well when some of individual data sets are uploaded.
The GMT files through that link that you posted do not look correct, to be honest. Take a look at the format and compare to the one that I posted, above.
Are you using GSEA JAVA version from the command line?
I am not using the command line but GSEA Desktop Application. The format of these files looks different from the one you posted. I do not know from which source I can get the gene data sets for Arabidopsis. I could not find Arabidopsis on gsea/msigdb. Could you please suggest some link? Other thing I would like to be clear, I am using ATH1_121501.chip for Chip platform in GSEA analysis. Is this the right chip to use for Arabidopsis thaliana plant RNA-seq data?
I think that it may involve some searching. For example, I found this and they gene sets are in the correct format: http://www.go2msig.org/cgi-bin/prebuilt.cgi?taxid=3702
It is also easy to create custom gene sets.