Question: low expressed gene filtration, quantile normalization and log2 transformation, which one goes first?
0
gravatar for ewre
2.8 years ago by
ewre220
United States
ewre220 wrote:

Hi everyone,

I have been dealing with expression data for about 4 years (both microarray and rna-seq). but this question still confuses me when I do data preprocessing. 1) My opinion is that at least we should do low expressed gene filtration first. Reason is that: the aim for quantile/log2 transform is to make the data distribution more proper. but if quantile/log2 goes first and then followed by low-expressed gene filtration, we may break the distribution.

2) For log2 transform and quantile normalization, I really don;t know which one goes first.

Thank you in advance for your time and valuable suggestion.

ADD COMMENTlink modified 2.8 years ago by Farbod3.3k • written 2.8 years ago by ewre220

If you Remove low expressed genes first (across the samples/cohort) and then do log transform(FPKM + 1),the results should be fine.

ADD REPLYlink written 2.8 years ago by Ron970

So is this question about RNA-seq or microarray?

ADD REPLYlink written 2.8 years ago by WouterDeCoster40k

Hi WouterDeCoster, I want to make it as general for both RNA-seq and microarray.

ADD REPLYlink written 2.8 years ago by ewre220

RNA-seq and microarray are both transcriptomics, but that's the end of the similarities. Microarray are continuous intensities, RNA-seq discrete counts (sampled from a negative binomial distribution: overdispersed poisson distribution).

I'll leave microarray analysis for someone else, but most acceptable is for RNA-seq to use tools like DESeq2 and edgeR which model the data assuming this negative binomial distribution. So you don't want to preprocess the data here, because for the software to work optimally it expects raw, unmanipulated counts.

ADD REPLYlink written 2.8 years ago by WouterDeCoster40k

Thank you. exactly, for raw read count of rna-seq data, I usually use deseq2 and edger to do DEG analysis. but sometimes I have to go with only rpkm/fpkm data. that's where I get trouble.

ADD REPLYlink written 2.8 years ago by ewre220
0
gravatar for Farbod
2.8 years ago by
Farbod3.3k
Toronto
Farbod3.3k wrote:

Dear hanguangchun, Hi.

I think removing low expression then -> log2 transform is more usual.

Also, please have a look at There are too many transcripts! What do I do?

and the IsoPct < 1 section of this paper for excluding the spurious transcripts.

~ Best

ADD COMMENTlink written 2.8 years ago by Farbod3.3k

Thank you very much for the information, Farbod.

ADD REPLYlink written 2.8 years ago by ewre220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1079 users visited in the last hour