Question

bumphunter for 3 (or more) groups

0

Entering edit mode

6.6 years ago

alfonsosaera ▴ 50

Dear BioStar Users,

I am using minfi to analyze the results of an EPIC array. To check differential methylation I decided to look for differentially methylated regions (DMRs) using bumphunter as recommended.

bumphunter function returns candidate DMRs with several columns of information including value and area (clarification in this biostars post).

For 2 groups, value column is the average of the coefficients of the linear model calculated for each site within the DMR while area column is the sum of all coefficients. This can be easily checked by calculating the difference, between the 2 groups, of the mean value (M or beta) for each methylation site of the DMR and then averaging (value column) or summing (area column) these differences.

Thus, a positive value indicates more methylation in the non-control group, and the opposite for a negative value. Until here, everything is clear and makes sense.

My problem is that I have 3 groups, say control, drug A and drug B for example. But, bumphunter still returns a unique value and area. How do you interpret the value and area of each candidate DMR? What do they mean in terms methylation levels in each group? How would you filter them to get biological information?

To answer these questions I went to check the function code but got more confused.

bumphunter function can be accessed by typing getMethod("bumphunter", "matrix") and you get:

function (object, ...) 
{
  .local <- function (object, design, chr = NULL, pos, cluster = NULL, 
                      coef = 2, cutoff = NULL, pickCutoff = FALSE, pickCutoffQ = 0.99, 
                      maxGap = 500, nullMethod = c("permutation", "bootstrap"), 
                      smooth = FALSE, smoothFunction = locfitByCluster, useWeights = FALSE, 
                      B = ncol(permutations), permutations = NULL, verbose = TRUE, 
                      ...) 
  {
    nullMethod <- match.arg(nullMethod)
    if (missing(design)) 
      stop("design must be specified")
    if (missing(pos)) 
      stop("If object is a matrix, pos must be specified")
    bumphunterEngine(object, design = design, chr = chr, 
                     pos, cluster = cluster, coef = coef, cutoff = cutoff, 
                     pickCutoff = pickCutoff, pickCutoffQ = pickCutoffQ, 
                     maxGap = maxGap, nullMethod = nullMethod, smooth = smooth, 
                     smoothFunction = smoothFunction, useWeights = useWeights, 
                     B = B, permutations = NULL, verbose = verbose, ...)
  }
  .local(object, ...)
}

So we find that it is, basically, a S4 method to call the bumphunterEngine function that does most of the job.

So, typing bumphunterEngine you get:

function (mat, design, chr = NULL, pos, cluster = NULL, coef = 2, 
          cutoff = NULL, pickCutoff = FALSE, pickCutoffQ = 0.99, maxGap = 500, 
          nullMethod = c("permutation", "bootstrap"), smooth = FALSE, 
          smoothFunction = locfitByCluster, useWeights = FALSE, B = ncol(permutations), 
          permutations = NULL, verbose = TRUE, ...) 
{
  ...

  if (useWeights & smooth) {
    tmp <- .getEstimate(mat = mat, design = design, coef = coef, 
                        full = TRUE)
    rawBeta <- tmp$coef
    weights <- 1/tmp$sigma
    rm(tmp)
  }
  else {
    rawBeta <- .getEstimate(mat = mat, design = design, coef = coef, 
                            full = FALSE)
    weights <- NULL
  }

  ...

  else {
    beta <- rawBeta
    Index <- seq(along = beta)
  }

  ...

  comp <- computation.tots2(tab = tab, A = A)
  rate2 <- comp$rate2
  pvalues2 <- comp$pvalues2
  tab$p.value <- pvalues1
  tab$fwer <- rate1
  tab$p.valueArea <- pvalues2
  tab$fwerArea <- rate2
  tab <- tab[order(tab$fwer, -tab$area), ]
  algorithm <- list(version = packageDescription("bumphunter")$Version, 
                    coef = coef, cutoff = cutoff, pickCutoff = pickCutoff, 
                    pickCutoffQ = pickCutoffQ, smooth = smooth, maxGap = maxGap, 
                    nullMethod = nullMethod, B = B, permutations = permutations, 
                    useWeights = useWeights, smoothFunction = deparse(substitute(smoothFunction)))
  ret <- list(table = tab, coef = rawBeta, fitted = beta, pvaluesMarginal = pvs, 
              null = list(value = V, length = L), algorithm = algorithm)
  class(ret) <- "bumps"
  return(ret)
}

It is a long function, but the relevant parts for this problem IMO are coef = 2, in the parameters of the function, and .getEstimate(mat = mat, design = design, coef = coef, ...) which calculates the coefficients.

Regarding coef = 2, we can see that the original bumphunter function already has it defined as defaults. Also, knowing that bumhunter was originally defined to compare 2 groups I went to check the minfi wrapper to see if they implemented a change.

setMethod(
  "bumphunter",
  signature(object = "GenomicRatioSet"),
  function(object, design, cluster = NULL, coef = 2, cutoff = NULL,
           pickCutoff = FALSE, pickCutoffQ = 0.99, maxGap = 500,
           nullMethod = c("permutation", "bootstrap"), smooth = FALSE,
           smoothFunction = locfitByCluster, useWeights = FALSE,
           B = ncol(permutations), permutations = NULL, verbose = TRUE,
           type = c("Beta","M"), ...) {

    .isMatrixBackedOrStop(object, "bumphunter")

    type <- match.arg(type)
    bumphunterEngine(
      mat = getMethSignal(object, type),
      design = design,
      chr = as.factor(seqnames(object)),
      pos = start(object),
      cluster = cluster,
      coef = coef,
      cutoff = cutoff,
      pickCutoff = pickCutoff,
      pickCutoffQ = pickCutoffQ,
      maxGap = maxGap,
      nullMethod = nullMethod,
      smooth = smooth,
      smoothFunction = smoothFunction,
      useWeights = useWeights,
      B = B,
      permutations = permutations,
      verbose = verbose,
      ...)
  }
)

We see that the minfi wrapper basically fits the right matrix of data (mat = getMethSignal(object, type)) to dhe bumphunterEngine function keeping the coef = 2 option.

At this point, I went to check the other function .getEstimate(mat = mat, design = design, coef = coef, ...) which calculates the coefficients, if I understood properly, based on the data matrix (mat = mat), the experimental design (design = design) and the coefficient (coef = coef, which gets set to 2). The code can be found here.

.getEstimate <- function(mat, design, coef, B=NULL, permutations=NULL,
                         full=FALSE) {
  v <- design[,coef]
  A <- design[,-coef, drop=FALSE]
  qa <- qr(A)
  S <- diag(nrow(A)) - tcrossprod(qr.Q(qa))

  vv <- if (is.null(B)) matrix(v, ncol=1) else{
    if(is.null(permutations)){
      replicate(B, sample(v))
    } else{
      apply(permutations,2,function(i) v[i])
    }
  }


  sv <- S %*% vv
  vsv <- diag(crossprod(vv,sv))

  b <- (mat %*% crossprod(S, vv)) / vsv
  if(!is.matrix(b))
    b <- matrix(b, ncol = 1)
  if (full) {
    sy <- mat %*% S
    df.residual <- ncol(mat) - qa$rank - 1
    if (is.null(B)) {
      sigma <- matrix(sqrt(rowSums((sy - tcrossprod(b, sv))^2) / df.residual), ncol=1)
    } else {
      sigma <- b
      tmp <- sy
      for (j in 1:B) {
        tmp <- tcrossprod(b[,j], sv[,j])
        sigma[,j] <- rowSums((sy-tmp)^2)
      }
      sigma <- sqrt(sigma/df.residual)
    }
    out <- list(coef=b,
                sigma=sigma,
                stdev.unscaled=sqrt(1/vsv),
                df.residual=df.residual)
    if (is.null(B)) out$stdev <- as.numeric(out$stdev)
  } else {
    out <- b
  }
  return(out)
}

and my design would be something like:

des.groups <- c(rep("control", 2), rep("drug.A", 2), rep('drug.B', 2))
design <- model.matrix(~ des.groups)
design

  (Intercept) des.groupsdrug.A des.groupsdrug.B
1           1                0                0
2           1                0                0
3           1                1                0
4           1                1                0
5           1                0                1
6           1                0                1

I am worried that, just at the beggining of the .getEstimate function, it subsets the design based on the coef argument

v <- design[,coef]

getting

> v
1 2 3 4 5 6 
0 0 1 1 0 0

However this is as far as I can go with my linear algebra knowledge. So I am wondering if the value and area returned by bumphunter reflect the mehylation changes of "just" 2 groups (in my case control (Intercept) and des.groupsdrug.A).

What do you think? If this is the case, how would you address the comparison among 3 groups? with 3 pairwise comparisons?

If you arrived here, THANKS for reading the long post and please share your thoughts.

methylation minfi EPIC array bumphunter DMR • 2.8k views

ADD COMMENT • link updated 6.0 years ago by Biostar 20 • written 6.6 years ago by alfonsosaera ▴ 50