Dear BioStar Users,
I am using minfi to analyze the results of an EPIC array. To check differential methylation I decided to look for differentially methylated regions (DMRs) using bumphunter as recommended.
bumphunter function returns candidate DMRs with several columns of information including value and area (clarification in this biostars post).
For 2 groups, value column is the average of the coefficients of the linear model calculated for each site within the DMR while area column is the sum of all coefficients. This can be easily checked by calculating the difference, between the 2 groups, of the mean value (M or beta) for each methylation site of the DMR and then averaging (value column) or summing (area column) these differences.
Thus, a positive value indicates more methylation in the non-control group, and the opposite for a negative value. Until here, everything is clear and makes sense.
My problem is that I have 3 groups, say control, drug A and drug B for example. But, bumphunter still returns a unique value and area. How do you interpret the value and area of each candidate DMR? What do they mean in terms methylation levels in each group? How would you filter them to get biological information?
To answer these questions I went to check the function code but got more confused.
bumphunter function can be accessed by typing getMethod("bumphunter", "matrix") and you get:
function (object, ...)
{
.local <- function (object, design, chr = NULL, pos, cluster = NULL,
coef = 2, cutoff = NULL, pickCutoff = FALSE, pickCutoffQ = 0.99,
maxGap = 500, nullMethod = c("permutation", "bootstrap"),
smooth = FALSE, smoothFunction = locfitByCluster, useWeights = FALSE,
B = ncol(permutations), permutations = NULL, verbose = TRUE,
...)
{
nullMethod <- match.arg(nullMethod)
if (missing(design))
stop("design must be specified")
if (missing(pos))
stop("If object is a matrix, pos must be specified")
bumphunterEngine(object, design = design, chr = chr,
pos, cluster = cluster, coef = coef, cutoff = cutoff,
pickCutoff = pickCutoff, pickCutoffQ = pickCutoffQ,
maxGap = maxGap, nullMethod = nullMethod, smooth = smooth,
smoothFunction = smoothFunction, useWeights = useWeights,
B = B, permutations = NULL, verbose = verbose, ...)
}
.local(object, ...)
}
So we find that it is, basically, a S4 method to call the bumphunterEngine function that does most of the job.
So, typing bumphunterEngine you get:
function (mat, design, chr = NULL, pos, cluster = NULL, coef = 2,
cutoff = NULL, pickCutoff = FALSE, pickCutoffQ = 0.99, maxGap = 500,
nullMethod = c("permutation", "bootstrap"), smooth = FALSE,
smoothFunction = locfitByCluster, useWeights = FALSE, B = ncol(permutations),
permutations = NULL, verbose = TRUE, ...)
{
...
if (useWeights & smooth) {
tmp <- .getEstimate(mat = mat, design = design, coef = coef,
full = TRUE)
rawBeta <- tmp$coef
weights <- 1/tmp$sigma
rm(tmp)
}
else {
rawBeta <- .getEstimate(mat = mat, design = design, coef = coef,
full = FALSE)
weights <- NULL
}
...
else {
beta <- rawBeta
Index <- seq(along = beta)
}
...
comp <- computation.tots2(tab = tab, A = A)
rate2 <- comp$rate2
pvalues2 <- comp$pvalues2
tab$p.value <- pvalues1
tab$fwer <- rate1
tab$p.valueArea <- pvalues2
tab$fwerArea <- rate2
tab <- tab[order(tab$fwer, -tab$area), ]
algorithm <- list(version = packageDescription("bumphunter")$Version,
coef = coef, cutoff = cutoff, pickCutoff = pickCutoff,
pickCutoffQ = pickCutoffQ, smooth = smooth, maxGap = maxGap,
nullMethod = nullMethod, B = B, permutations = permutations,
useWeights = useWeights, smoothFunction = deparse(substitute(smoothFunction)))
ret <- list(table = tab, coef = rawBeta, fitted = beta, pvaluesMarginal = pvs,
null = list(value = V, length = L), algorithm = algorithm)
class(ret) <- "bumps"
return(ret)
}
It is a long function, but the relevant parts for this problem IMO are coef = 2, in the parameters of the function, and .getEstimate(mat = mat, design = design, coef = coef, ...) which calculates the coefficients.
Regarding coef = 2, we can see that the original bumphunter function already has it defined as defaults. Also, knowing that bumhunter was originally defined to compare 2 groups I went to check the minfi wrapper to see if they implemented a change.
setMethod(
"bumphunter",
signature(object = "GenomicRatioSet"),
function(object, design, cluster = NULL, coef = 2, cutoff = NULL,
pickCutoff = FALSE, pickCutoffQ = 0.99, maxGap = 500,
nullMethod = c("permutation", "bootstrap"), smooth = FALSE,
smoothFunction = locfitByCluster, useWeights = FALSE,
B = ncol(permutations), permutations = NULL, verbose = TRUE,
type = c("Beta","M"), ...) {
.isMatrixBackedOrStop(object, "bumphunter")
type <- match.arg(type)
bumphunterEngine(
mat = getMethSignal(object, type),
design = design,
chr = as.factor(seqnames(object)),
pos = start(object),
cluster = cluster,
coef = coef,
cutoff = cutoff,
pickCutoff = pickCutoff,
pickCutoffQ = pickCutoffQ,
maxGap = maxGap,
nullMethod = nullMethod,
smooth = smooth,
smoothFunction = smoothFunction,
useWeights = useWeights,
B = B,
permutations = permutations,
verbose = verbose,
...)
}
)
We see that the minfi wrapper basically fits the right matrix of data (mat = getMethSignal(object, type)) to dhe bumphunterEngine function keeping the coef = 2 option.
At this point, I went to check the other function .getEstimate(mat = mat, design = design, coef = coef, ...) which calculates the coefficients, if I understood properly, based on the data matrix (mat = mat), the experimental design (design = design) and the coefficient (coef = coef, which gets set to 2). The code can be found here.
.getEstimate <- function(mat, design, coef, B=NULL, permutations=NULL,
full=FALSE) {
v <- design[,coef]
A <- design[,-coef, drop=FALSE]
qa <- qr(A)
S <- diag(nrow(A)) - tcrossprod(qr.Q(qa))
vv <- if (is.null(B)) matrix(v, ncol=1) else{
if(is.null(permutations)){
replicate(B, sample(v))
} else{
apply(permutations,2,function(i) v[i])
}
}
sv <- S %*% vv
vsv <- diag(crossprod(vv,sv))
b <- (mat %*% crossprod(S, vv)) / vsv
if(!is.matrix(b))
b <- matrix(b, ncol = 1)
if (full) {
sy <- mat %*% S
df.residual <- ncol(mat) - qa$rank - 1
if (is.null(B)) {
sigma <- matrix(sqrt(rowSums((sy - tcrossprod(b, sv))^2) / df.residual), ncol=1)
} else {
sigma <- b
tmp <- sy
for (j in 1:B) {
tmp <- tcrossprod(b[,j], sv[,j])
sigma[,j] <- rowSums((sy-tmp)^2)
}
sigma <- sqrt(sigma/df.residual)
}
out <- list(coef=b,
sigma=sigma,
stdev.unscaled=sqrt(1/vsv),
df.residual=df.residual)
if (is.null(B)) out$stdev <- as.numeric(out$stdev)
} else {
out <- b
}
return(out)
}
and my design would be something like:
des.groups <- c(rep("control", 2), rep("drug.A", 2), rep('drug.B', 2))
design <- model.matrix(~ des.groups)
design
(Intercept) des.groupsdrug.A des.groupsdrug.B
1 1 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 1 0 1
6 1 0 1
I am worried that, just at the beggining of the .getEstimate function, it subsets the design based on the coef argument
v <- design[,coef]
getting
> v
1 2 3 4 5 6
0 0 1 1 0 0
However this is as far as I can go with my linear algebra knowledge. So I am wondering if the value and area returned by bumphunter reflect the mehylation changes of "just" 2 groups (in my case control (Intercept) and des.groupsdrug.A).
What do you think? If this is the case, how would you address the comparison among 3 groups? with 3 pairwise comparisons?
If you arrived here, THANKS for reading the long post and please share your thoughts.