Question: How to set a threshold for coefficient of variation in RNA-seq?
5
gaelgarcia170 wrote:

Hi all,

I am interested in extracting the genes from a dataset which have a high dynamic range in terms of their expression.
This is specifically for using the Monocle package for single-cell RNA-seq analysis. The author suggests reducing the gene list to those genes that a) are detected in a sufficient amount of cells, and b) vary over a sufficiently large dynamic range.

My idea of a sufficiently large dynamic range would be those genes that are above a certain threshold in a coefficient of variation plot (standard deviation / mean ), i.e., those genes which vary above that which is expected based on their mean expression across all samples.

However, I don't know how to obtain this expectation in order to define which genes deviate from it. I'm sure this is a common task in RNA-seq... any ideas?

Thanks!

modified 5.3 years ago • written 5.3 years ago by gaelgarcia170

The first thing to do is to make sure you have done variance stabilization. As RNA Seq data tends to follow the Negative Beta Binomial model, their SD tends to increase with mean. So before you calculate that ratio, try to use DESeq2 or other tools to variance stabilize your data.

Thank you, Sam. My understanding is that this plot helps precisely to do that - describe the mean-dependent variance of gene expression (based on higher-expressing genes tending to vary less in their expression than lower-expressing genes). And then I would have to find the equation that describes this behavior for my data, and  find the genes that deviate sufficiently from what is expected. But I don't know how to do that - are there any tools you would recommend for this?

1

If you are looking for the variance stabilization, you can always use the following code assuming:

sampleInfo = file with your sample condition, where first column = sample name and second column = condition

inputTable = raw count table

```library(DESeq2)
colData = data.frame(row.names=row.names(sampleInfo), condition=sampleInfo[,1])
dds <- DESeqDataSetFromMatrix(countData = inputTable, colData = colData, design = ~ condition)
colData(dds)\$condition <- factor(colData(dds)\$condition,levels=unique(sampleInfo[,1]))
colData(dds)\$condition <- relevel(colData(dds)\$condition,  as.character(sampleInfo[1,1]))
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)```

Then the vsd should contain the variance stabilized count value.

However, if you are looking for a deviation from the expected, are you actually looking for something similar to differential gene expression? If so, you can just simply follow the DESeq2 tutorial