Hi all, I'm performing some differential analysis with edgeR and I need to take into account GC content in my regions. From edgeR manual I see I should set GC percentage as "offset" in linear model. Unfortunately there is very little documentation about this, so I need some hint.
Offsets can be passed both to
glmFit functions, but it is not clear how they are handled in the two steps. Setting offset in estimateDisp, i.e.
y = DGEList(…) y = calcNormFactors(y) y$offset = gc_content #this is a vector with %GC for every region in y y = estimateDisp(y, design)
increases the dispersion values (common, tagwise and trended) in the DGEList object, with negative effect on final pvalues and FDR. Setting it at GLM, i.e.
y = DGEList(…) y = calcNormFactors(y) y = estimateDisp(y, design) fit = glmFit.default(y, design, offset=gc_content, dispersion=y$tagwise.dispersion)
does exactly the opposite: all p-values and FDR become "incredibly" significant, also average logCPM for each region is increased 10-fold (which, I suspect, is the reason why I have lower p-values).
What would be the proper way to include GC information in edgeR?