GSEA matrix error appears during analysis
1
0
Entering edit mode
9.4 years ago

Dear All,

I would like to use java GSEA software to analyse RNA-Seq FPKM datasets for enriched KEGG pathways. I made .txt gene expression data that could be read by the programme as well as appropriate .cls phenotype file. I downloaded the KEGG gene sets from http://www.broadinstitute.org/gsea/msigdb/collections.jsp

I use gene symbols in both gene sets and expression table. Unfortunately the analysis comes up with an error message like this:

Full Error Message

col:4 > matrix's fColCnt:4

---- Stack Trace ----
# of exceptions: 1
------col:4 > matrix's fColCnt:4------
java.lang.ArrayIndexOutOfBoundsException: col:4 > matrix's fColCnt:4
                at edu.mit.broad.genome.math.Matrix.getColumnV(EIKM)
                at edu.mit.broad.genome.objects.DefaultDataset.getColumn(EIKM)
                at edu.mit.broad.genome.objects.TemplateFactory.extract(EIKM)
                at edu.mit.broad.genome.alg.DatasetGenerators.extract(EIKM)
                at edu.mit.broad.genome.alg.DatasetGenerators.extract(EIKM)
                at xtools.gsea.AbstractGsea2Tool.execute_one(EIKM)
                at xtools.gsea.AbstractGsea2Tool.execute_one_with_reporting(EIKM)
                at xtools.gsea.Gsea.execute(EIKM)
                at edu.mit.broad.xbench.tui.TaskManager$ToolRunnable.run(EIKM)
                at java.lang.Thread.run(Unknown Source)

I tried many things to fix this problem. I re-run the analysis with number of permutations 1 and 5. I modified the expression file and substituted every value of 0 into 0.001 (I thought that may have been the problem). But it didn't work.

Your help will be deeply appreciated.

software-error RNA-Seq • 8.7k views
ADD COMMENT
0
Entering edit mode

How many columns are there in your matrix? Also, can you show the .cls file? Does your matrix have 3 columns and your .cls file says it has 4?

ADD REPLY
0
Entering edit mode

Hi thanks for the reply,

You can see both the successfully uploaded expression dataset and phenotype file below.

phen.cls

12 4 1
#UT NC A C
UT UT UT NC NC NC A A A C C C

expression.txt

NAME    DESCRIPTION    UT    NC    A    C
MARCH1    na    0.6347485    1.223443    1.189554    0.965276
SEPT1    na    3.2725335    0.9773395    1.9080605    2.511335
DEC1    na    0.250527    0.0911614    0.183621    0.156866
MARCH2    na    5.224765    4.12003    3.87765    4.29023
SEPT2    na    119.27    143.509    154.205    148.109
MARCH3    na    0.866247    0.324322    0.373725    0.141192
SEPT3    na    0.0581423    0.00590876    0.0156654    0.0249253
MARCH4    na    0.813892    1.47467    1.06931    1.89613
SEPT4    na    0.279058    0.589557    0.371078    0.423102
MARCH5    na    35.6057    39.1863    35.6942    37.1779
MARCH6    na    44.1798    47.239    47.5794    46.8037
SEPT6    na    2.19308    3.22229    4.73057    5.83277
MARCH7    na    58.873    51.4146    47.9548    45.0427
SEPT7    na    63.8154    69.0978    69.9618    65.211
MARCH8    na    2.951695    2.08923    1.938945    1.93033
SEPT8    na    16.5951    16.1855    12.8771    20.2097
MARCH9    na    8.32906    6.12506    7.19667    6.36341
SEPT9    na    34.79395    40.97635    53.7905    44.52175
MARCH10    na    0.105531    0.0756297    0.141385    0.062932
SEPT10    na    34.25834    33.3837    37.553966    32.46296
MARCH11    na    0    0    0    0

Notice that some genes had FPKM of 0 and I was wondering whether that is causing the problem?

Thanks for the help again.

ADD REPLY
0
Entering edit mode

According to your .cls, you should have 12 columns and 4 conditions. But you only have 4 columns of expression data in the matrix.

ADD REPLY
0
Entering edit mode

I thought to first value refers to the total number conditions. OK will change it to 4 4 1 and see if it works.

Thanks

ADD REPLY
0
Entering edit mode
9.4 years ago
komal.rathi ★ 4.1k

Why did you put 12 in the first value? Do you have data for 12 samples or just 4 samples? If you don't have data for 12 samples, did you just randomly select 12?!

12 4 1 <- This means you have 12 samples & 4 conditions
#UT NC A C <- This tells what are the conditions
UT UT UT NC NC NC A A A C C C <- This means that you have 12 samples, 3 samples per condition, the first three being UT, the next three NC and so on.

So, you SHOULD have 12 data columns in your expression matrix.

In the phenotype file, the first value is Total number of Samples. The total number of samples is different than the total number of conditions. Number of conditions is always < Number of samples if you have replicates. If you do not have any replicates, then Number of conditions = Number of samples.

ADD COMMENT
0
Entering edit mode

It is clear now thanks. In fact I have 12 samples in total. 3 biological replicates in each condition and there are 4 conditions. My understanding from the GSEA manual is that it is better to supply averaged expressions value and I wasn't aware that the phenotype file has to 'talk to' expression table. Bottom line then is that the first value in phenotype file is the number of columns that contain expression data. Cheers mate

ADD REPLY

Login before adding your answer.

Traffic: 3212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6