Question: Differential gene expression analysis of real-valued expression data of genes and individuals
gravatar for Davide Chicco
20 months ago by
Davide Chicco110
Davide Chicco110 wrote:


I've a dataset of gene expressions of 102 patients and 9 healthy controls. I downloaded this dataset from GEO, I applied several preprocessing steps (normalization, batch correction based on date, etc), and I was finally able to generate a table containing:

  • the individuals on the rows

  • the genes on the columns

  • each entry ij containing a real value that indicates the expression of the gene_i in the individual_j

This first preprocessing phase was a lot of effort. Now I would like to perform a differential gene expression analysis, to see how the genes expressions differ between the patients and the healthy controls.

I checked some packages online (such as DESeq2), and I noticed they all have specific requirements for input files, that need to contain raw counts. Unfortunately, I don't have raw counts.

I would like to perform a differential gene expression analysis by myself, by taking advantage of biostatistics R functions applied on my preprocessed tables.

How can I do it? Any suggestion?


ADD COMMENTlink modified 20 months ago by h.mon32k • written 20 months ago by Davide Chicco110
gravatar for h.mon
20 months ago by
h.mon32k wrote:

Look at Linear Models for Microarray Data, or limma. The User Guide is particularly helpful:

ADD COMMENTlink written 20 months ago by h.mon32k

Adding on this, what you have is array data which provides you with intensity values, not counts so a relative measure of gene expression rather than absolute counts as in RNA-seq. limma seems to be pretty much the standard and following their workflow should get you the intended results. Be sure to read the manual thoroughly and also look at this end-to-end workflow for Affymetrix microarrays.

ADD REPLYlink modified 20 months ago • written 20 months ago by ATpoint44k

Thank you guys for your replies. There's a lot of material online and I feel like I am drowning in it. I found this interesting question and answer here on, that I tried to implement for my case. I used lmFit(table) and eBayes(fit), as explained, without design.

I was able to generate a table with the values of the fitted model for the patients, and a table for the healthy controls. This is the head of the topTable of the patients fit:

head(topTable(fit, n=Inf, sort="p", p.value=0.05))

logFC AveExpr t P.Value adj.P.Val B

EEF1A1P5 13.2 13.2 362 5.03e-26 2.05e-22 45.2

MIR6891 13.2 13.2 358 5.77e-26 2.05e-22 45.2

HLA.G 12.8 12.8 356 6.34e-26 2.05e-22 45.1

RN7SK 13.6 13.6 355 6.41e-26 2.05e-22 45.1

MALAT1 12.8 12.8 354 6.65e-26 2.05e-22 45.1

HLA.J 12.9 12.9 352 7.31e-26 2.05e-22 45.1

Some questions:

1) What is the meaning of these p-values associated to each gene that I found this way?

2) Was it a good/useful idea to split the patients and healthy controls into two different tables and perform the analysis separately? Or should I keep them together and insert this information into the design parameter? If the latter, how?


ADD REPLYlink modified 19 months ago • written 20 months ago by Davide Chicco110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1023 users visited in the last hour