Question: What is the accurate order of preprocessing steps in DNA Microarray gene expression analysis?
0
gravatar for J. Smith
5 months ago by
J. Smith0
J. Smith0 wrote:

Hi friends,

I have downloaded DNA Microarray data from NCBI. Data contains both control samples and affected samples for all genes. I want to perform downstream analysis like clustering, classification. I know that some preprocessing steps like normalization, log2 transformation and differential expressed genes selection are necessary before performing clustering or classification.

But I am unsure about the exact order of such preprocessing steps although I know that normalization is performed before log2 transformation. Please let me know the following things:

1) Whether preprocessing steps normalization and then log2 transformation needs to be done before differential expressed genes selection and differential expressed genes selection needs to be done using modified normalized and log2-ed data?

2) In case of RNASeq data, I learned that differential expression analysis is done using un-normalized and un-logged count data as the statistical model is most powerful when applied to un-normalized counts. Then whether we can also select differentially expressed genes from microarray data without performing normalization and log2 transformation? Please note that I will use SAM or Limma for selecting differentially expressed genes from microarray data.

3) Are there any other preprocessing or quality control steps necessary before clustering? If so please mention their exact order.

Thanks in advance.

ADD COMMENTlink modified 5 months ago by Santosh Anand5.0k • written 5 months ago by J. Smith0
1

See this end-to-end workflow.

ADD REPLYlink written 5 months ago by ATpoint25k

Thanks a lot. Now I have understood, there are a lot more preprocessing steps which I have to carry out before we can apply limma for differentially expressed genes. And limma can be applied with the final preprocessed data only. But can you please tell me why it is different from RNASeq? I mean why limma should be applied using final preprocessed data in case of microarray whereas in case of RNASeq, DESeq2 should be applied with raw count data without normalization and log2?

ADD REPLYlink written 5 months ago by J. Smith0
1

some packages like gcrma take care of normalization and log transformation. You can refer here- http://www.bioconductor.org/packages/release/bioc/html/gcrma.html

Then you can use these value to perform DE analysis using limma

ADD REPLYlink written 5 months ago by sangram_keshari220

Thanks a lot for your response and referred package.

ADD REPLYlink written 5 months ago by J. Smith0
2
gravatar for Santosh Anand
5 months ago by
Santosh Anand5.0k
Santosh Anand5.0k wrote:

Your Qs are very general and broad - It would be better if you start with some tutorial of micro-array analysis and then formulate specific Qs.

https://bioconductor.org/packages/devel/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html#1_introduction

To briefly answer your Qs:

  1. Normalization is necessary to compare across samples. This is essential before you do any downstream analysis (Clustering, DE etc.). Expression data vary widely and are skewed. Log2 is an useful transform to make the data behave more "normal" and also to reduce variability:

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/

    https://genomicsclass.github.io/book/pages/robust_summaries.html

    Additional advantage is that you can interpret the Fold Changes in terms of multiple of 2. For clustering, pca (or any other dimensional reduction analysis), it is imperative that you use normalized values, which are often returned as log2-values.

  2. RNAseq data are count data and they are discrete compared to microarray data, which are intensity values and are continuous. They follow different statistical models, and so their DE analysis is a bit different. But in either case, normalization is fundamental as that allows the samples to compare among themselves. See this tutorial for the basic idea of DE using limma

    https://www.bioconductor.org/help/course-materials/2005/BioC2005/labs/lab01/estrogen/

  3. Normalization is essential. Also you may select only highly variable genes for clustering/pca.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Santosh Anand5.0k

Thanks a lot for your extensive response with reference papers.

ADD REPLYlink written 5 months ago by J. Smith0
1

Thank you. I updated the post with another relevant reference and correct formatting.

ADD REPLYlink written 5 months ago by Santosh Anand5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1118 users visited in the last hour