Filtering low variance genes for WGCNA
Entering edit mode
16 months ago
pennakiza ▴ 60

Hello everyone,

I have a question re filtering for low variance prior to WGCNA. I have got RNASeq data, pre-filtered for low counts and transformed with DESeq2 vst. I was wondering if you could help me select from the two methods below the one that is more correct for my data.

filter <- function(x)(IQR(x, na.rm=T)>0.25)
filtered_genes <- genefilter(df,filter)


data$variance = apply(data, 1, var)
data = data[data$variance >= quantile(data$variance, c(0.25)), ]
data$variance <- NULL

Thank you very much for your help!


WGCNA genefilter IQR variance filtering • 1.4k views
Entering edit mode
16 months ago

Hi Penny,

In my opinion, if you have already produced variance-stabilised expression levels via the standard DESeq2 workflow, then no additional filtering for variance should be performed. DESeq2 specifically tackles the issue of low and high variability and its relationship with mean expression; so, you can assume that the problem has already been managed via the DESeq2 normalisation and transformation (VST)..

According to the WGCNA developers:

Should I filter probesets or genes? Probesets or genes may be filtered by mean expression or variance (or their robust analogs such as median and median absolute deviation, MAD) since low-expressed or non-varying genes usually represent noise. Whether it is better to filter by mean expression or variance is a matter of debate; both have advantages and disadvantages, but more importantly, they tend to filter out similar sets of genes since mean and variance are usually related.



Entering edit mode

I may add that some filtering is probably still meaningful. You want to exclude the noise so the genes that are either completely non-expressed (=0) and with very low variation between samples as no variation would mean no correlated changes in expression, so the very metric that WGCNA is interested in. You could simply plot the row-wise variance and draw a visual cutoff to exclude genes with low information content. Example:

vsd <- vst(dds, blind=FALSE)
rv <- matrixStats::rowVars(as.matrix(assay(vsd)))
rv2 <- data.frame(Seq = seq(1:nrow(vsd)), rowVars = rv[order(rv, decreasing = TRUE)])
theme_set(theme_bw(base_size = 10))
ggplot(rv2, aes(x=Seq,y=rowVars)) + geom_line() + scale_y_log10() +
  ggtitle("vst-transformed counts ordered by rowVar")

enter image description here

Based on this you could keep the top 10.000 genes as this is somewhat the inflextion point of the curve. Mind that the y-axis is already log10-scaled, on arithmetric scale the drop is even sharper. Keeping top-10k would exclude the majority of genes which most likely do not anything to the analysis as they probably do not any meaningful information.


Login before adding your answer.

Traffic: 1420 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6