Question: Differential gene expression analysis of real-valued expression data of genes and individuals
0
gravatar for Davide Chicco
5 months ago by
Canada
Davide Chicco90 wrote:

Hi

I've a dataset of gene expressions of 102 patients and 9 healthy controls. I downloaded this dataset from GEO, I applied several preprocessing steps (normalization, batch correction based on date, etc), and I was finally able to generate a table containing:

  • the individuals on the rows

  • the genes on the columns

  • each entry ij containing a real value that indicates the expression of the gene_i in the individual_j

This first preprocessing phase was a lot of effort. Now I would like to perform a differential gene expression analysis, to see how the genes expressions differ between the patients and the healthy controls.

I checked some packages online (such as DESeq2), and I noticed they all have specific requirements for input files, that need to contain raw counts. Unfortunately, I don't have raw counts.

I would like to perform a differential gene expression analysis by myself, by taking advantage of biostatistics R functions applied on my preprocessed tables.

How can I do it? Any suggestion?

Thanks!

ADD COMMENTlink modified 5 months ago by h.mon28k • written 5 months ago by Davide Chicco90
0
gravatar for h.mon
5 months ago by
h.mon28k
Brazil
h.mon28k wrote:

Look at Linear Models for Microarray Data, or limma. The User Guide is particularly helpful:

https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

ADD COMMENTlink written 5 months ago by h.mon28k
1

Adding on this, what you have is array data which provides you with intensity values, not counts so a relative measure of gene expression rather than absolute counts as in RNA-seq. limma seems to be pretty much the standard and following their workflow should get you the intended results. Be sure to read the manual thoroughly and also look at this end-to-end workflow for Affymetrix microarrays.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint26k

Thank you guys for your replies. There's a lot of material online and I feel like I am drowning in it. I found this interesting question and answer here on BioStars.org, that I tried to implement for my case. I used lmFit(table) and eBayes(fit), as explained, without design.

I was able to generate a table with the values of the fitted model for the patients, and a table for the healthy controls. This is the head of the topTable of the patients fit:

head(topTable(fit, n=Inf, sort="p", p.value=0.05))

logFC AveExpr t P.Value adj.P.Val B

EEF1A1P5 13.2 13.2 362 5.03e-26 2.05e-22 45.2

MIR6891 13.2 13.2 358 5.77e-26 2.05e-22 45.2

HLA.G 12.8 12.8 356 6.34e-26 2.05e-22 45.1

RN7SK 13.6 13.6 355 6.41e-26 2.05e-22 45.1

MALAT1 12.8 12.8 354 6.65e-26 2.05e-22 45.1

HLA.J 12.9 12.9 352 7.31e-26 2.05e-22 45.1

Some questions:

1) What is the meaning of these p-values associated to each gene that I found this way?

2) Was it a good/useful idea to split the patients and healthy controls into two different tables and perform the analysis separately? Or should I keep them together and insert this information into the design parameter? If the latter, how?

Thanks!

ADD REPLYlink modified 5 months ago • written 5 months ago by Davide Chicco90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1969 users visited in the last hour