Question: How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median
gravatar for fbrundu
5.0 years ago by
European Union
fbrundu290 wrote:

I want to transform a TCGA mRNA expression matrix (in linear data format) to log2-ratios and then run a feature (gene) selection, selecting the 1000 most variant genes (genes with higher standard deviation across samples). The workflow is the following:

 1. Select "good" genes before log2ratio (genes each with median signal at least t in p% of samples);
 2. On selected genes, run log2ratio, dividing each gene by its median signal and then log2-transforming the result matrix;
 3. Select the 1000 most variant genes along all samples.

How do I select t and p?

ADD COMMENTlink modified 5.0 years ago by Sean Davis26k • written 5.0 years ago by fbrundu290

Hello fbrundu!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 5.0 years ago by Devon Ryan94k

Hi Devon,

Yes, I am sorry that I annoyed you. But since they are different communities (afaik they are also run by different organizations) and they address slightly different topics, I did not know which was the correct place to post this question to. I think that my question is semantically correct for both communities (even if they can address different types of users), which may have an users intersection (I did not post elsewhere).

Sorry about that.


ADD REPLYlink written 5.0 years ago by fbrundu290
gravatar for Sean Davis
5.0 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

There is not a general solution to select "t" and "p".  Such choices are largely arbitrary.  Furthermore, for an array platform, if one assumes that "t" has something to do with "expressed", the value for "t" will differ for each probe on the array.  

Since you are ultimately going to filter based on variance, I'd suggest starting with your median-centered, log-transformed data and simply choose the top 1000 most variable genes.  

ADD COMMENTlink written 5.0 years ago by Sean Davis26k

The data that OP is referring to is RNAseq; so no probes. Sequencing bias correction can be done for them.

ADD REPLYlink written 5.0 years ago by Bharat Iyengar270

Thanks Sean. Regarding your last suggestion, I thought that it could introduce problems, since genes with very low median signal can show a high variance when logratio transformed. What do you think?


ADD REPLYlink written 5.0 years ago by fbrundu290

Well log transformation will bring down the variance. Imagine 4 samples [0.5, 2, 8, 32]. Without log transformation the variance is 213.5625 but when you log2 transform the data then the variance reduces to 6.67

In any case if the expression is consistently low then the variance will be low. You should be careful about log transformations especially when doing differential expression studies. I would suggest that you do the log transformation after selecting for median and variance.

ADD REPLYlink written 5.0 years ago by Bharat Iyengar270

I followed your first advice using custom thresholds.

Hoping this is useful, the code of the pipeline is available at

I filtered out the genes which were below the overall 5th percentile in more than the 5% of the samples. I think it could be a reasonable threshold. In case tell me.


ADD REPLYlink written 5.0 years ago by fbrundu290
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 655 users visited in the last hour