Question

What is the accurate order of preprocessing steps in DNA Microarray gene expression analysis?

2

Entering edit mode

4.9 years ago

J. Smith ▴ 80

Hi friends,

I have downloaded DNA Microarray data from NCBI. Data contains both control samples and affected samples for all genes. I want to perform downstream analysis like clustering, classification. I know that some preprocessing steps like normalization, log2 transformation and differential expressed genes selection are necessary before performing clustering or classification.

But I am unsure about the exact order of such preprocessing steps although I know that normalization is performed before log2 transformation. Please let me know the following things:

1) Whether preprocessing steps normalization and then log2 transformation needs to be done before differential expressed genes selection and differential expressed genes selection needs to be done using modified normalized and log2-ed data?

2) In case of RNASeq data, I learned that differential expression analysis is done using un-normalized and un-logged count data as the statistical model is most powerful when applied to un-normalized counts. Then whether we can also select differentially expressed genes from microarray data without performing normalization and log2 transformation? Please note that I will use SAM or Limma for selecting differentially expressed genes from microarray data.

3) Are there any other preprocessing or quality control steps necessary before clustering? If so please mention their exact order.

Thanks in advance.

microarray limma SAM preprocessing • 11k views

ADD COMMENT • link updated 4.9 years ago by Santosh Anand 5.7k • written 4.9 years ago by J. Smith ▴ 80

1

Entering edit mode

See this end-to-end workflow.

ADD REPLY • link 4.9 years ago by ATpoint 82k

0

Entering edit mode

Thanks a lot. Now I have understood, there are a lot more preprocessing steps which I have to carry out before we can apply limma for differentially expressed genes. And limma can be applied with the final preprocessed data only. But can you please tell me why it is different from RNASeq? I mean why limma should be applied using final preprocessed data in case of microarray whereas in case of RNASeq, DESeq2 should be applied with raw count data without normalization and log2?

ADD REPLY • link 4.9 years ago by J. Smith ▴ 80

1

Entering edit mode

some packages like gcrma take care of normalization and log transformation. You can refer here- http://www.bioconductor.org/packages/release/bioc/html/gcrma.html

Then you can use these value to perform DE analysis using limma

ADD REPLY • link 4.9 years ago by sangram_keshari ▴ 260

0

Entering edit mode

Thanks a lot for your response and referred package.

ADD REPLY • link 4.9 years ago by J. Smith ▴ 80

score 2 · Answer 1 · 2019-06-06

Your Qs are very general and broad - It would be better if you start with some tutorial of micro-array analysis and then formulate specific Qs.

https://bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html#1_introduction

To briefly answer your Qs:

Normalization is necessary to compare across samples. This is essential before you do any downstream analysis (Clustering, DE etc.). Expression data vary widely and are skewed. Log2 is an useful transform to make the data behave more "normal" and also to reduce variability:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/

https://genomicsclass.github.io/book/pages/robust_summaries.html

Additional advantage is that you can interpret the Fold Changes in terms of multiple of 2. For clustering, pca (or any other dimensional reduction analysis), it is imperative that you use normalized values, which are often returned as log2-values.
RNAseq data are count data and they are discrete compared to microarray data, which are intensity values and are continuous. They follow different statistical models, and so their DE analysis is a bit different. But in either case, normalization is fundamental as that allows the samples to compare among themselves. See this tutorial for the basic idea of DE using limma

https://www.bioconductor.org/help/course-materials/2005/BioC2005/labs/lab01/estrogen/
Normalization is essential. Also you may select only highly variable genes for clustering/pca.