Question: Filtering low variance genes for WGCNA
gravatar for pennakiza
9 weeks ago by
pennakiza60 wrote:

Hello everyone,

I have a question re filtering for low variance prior to WGCNA. I have got RNASeq data, pre-filtered for low counts and transformed with DESeq2 vst. I was wondering if you could help me select from the two methods below the one that is more correct for my data.

filter <- function(x)(IQR(x, na.rm=T)>0.25)
filtered_genes <- genefilter(df,filter)


data$variance = apply(data, 1, var)
data = data[data$variance >= quantile(data$variance, c(0.25)), ]
data$variance <- NULL

Thank you very much for your help!


ADD COMMENTlink modified 8 weeks ago by Kevin Blighe71k • written 9 weeks ago by pennakiza60
gravatar for Kevin Blighe
8 weeks ago by
Kevin Blighe71k
Republic of Ireland
Kevin Blighe71k wrote:

Hi Penny,

In my opinion, if you have already produced variance-stabilised expression levels via the standard DESeq2 workflow, then no additional filtering for variance should be performed. DESeq2 specifically tackles the issue of low and high variability and its relationship with mean expression; so, you can assume that the problem has already been managed via the DESeq2 normalisation and transformation (VST)..

According to the WGCNA developers:

Should I filter probesets or genes? Probesets or genes may be filtered by mean expression or variance (or their robust analogs such as median and median absolute deviation, MAD) since low-expressed or non-varying genes usually represent noise. Whether it is better to filter by mean expression or variance is a matter of debate; both have advantages and disadvantages, but more importantly, they tend to filter out similar sets of genes since mean and variance are usually related.



ADD COMMENTlink written 8 weeks ago by Kevin Blighe71k

I may add that some filtering is probably still meaningful. You want to exclude the noise so the genes that are either completely non-expressed (=0) and with very low variation between samples as no variation would mean no correlated changes in expression, so the very metric that WGCNA is interested in. You could simply plot the row-wise variance and draw a visual cutoff to exclude genes with low information content. Example:

vsd <- vst(dds, blind=FALSE)
rv <- matrixStats::rowVars(as.matrix(assay(vsd)))
rv2 <- data.frame(Seq = seq(1:nrow(vsd)), rowVars = rv[order(rv, decreasing = TRUE)])
theme_set(theme_bw(base_size = 10))
ggplot(rv2, aes(x=Seq,y=rowVars)) + geom_line() + scale_y_log10() +
  ggtitle("vst-transformed counts ordered by rowVar")

enter image description here

Based on this you could keep the top 10.000 genes as this is somewhat the inflextion point of the curve. Mind that the y-axis is already log10-scaled, on arithmetric scale the drop is even sharper. Keeping top-10k would exclude the majority of genes which most likely do not anything to the analysis as they probably do not any meaningful information.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by ATpoint46k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2248 users visited in the last hour