I have been using the DESeq VST method on gene counts produced by Htseq-count as follows:
cds <- newCountDataSet(countData = dat, conditions = factor(conditions)) cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds, sharingMode = "gene-est-only", method = "pooled", fitType = "local") vst <- getVarianceStabilizedData(cds)
But honestly, I do not understand what exactly the getVarianceStabilizedData() function does. Can someone explain in simple terms:
- Why is it necessary to normalize raw count data? Why can't we use the raw count data?
- How exactly are we normalizing the raw count data using getVarianceStabilizedData() function?
- Should the conditions parameter in the newCountDataSet() function match the conditions between which you want to find differentially expressed genes? For e.g. I have both cases & controls as well as males & females. So should I include both the information in the conditions parameter or just cases & controls?
I know these questions can be searched for easily, and I did. But I want a simple explanation from someone who uses these methods regularly to clear my concepts.