Question

What is Scheffé's test, and can post-hoc tests tell me in which time points a differential peak, from time series DNAse-seq data, had higher signal?

2

Entering edit mode

8.9 years ago

ejliaw ▴ 30

I have DNase-seq data from 7 time points, 2 replicates for each time point (14 samples total). For each DNase-seq peak, I want to determine in which subset of time points the signal is significantly increased compared to the complement set of other time points. A related question would be to determine in each time point whether each peak is present or absent. I could answer this by intersecting the lists of called peaks in each sample. However, I've noticed (by manual observation) that it is common for peak callers to call peaks in only in some samples even though an increase in signal vs. nearby background can be seen in all samples. Thus, I've chosen the wording in the first paragraph.

I've been acquainting myself with how edgeR/CSAW/DiffBind use linear regression & GLMs to find differential next-gen peaks. From the Internet, I have come up with the following workflow:

Create a union peak list by merging peaks called from 7 BAMs (each merged from 2 replicates) and get the counts of reads falling on each peak from the 14 samples.
In edgeR, treat each time point as a group and use an ANOVA-like test to determine which peaks are differential in some way. (Or replace #1 and #2 with CSAW. Lun ATL and Smyth GK (2014). Nucleic Acids Res., 42(11), pp. e95.)
To each peak passing step #2, apply post-hoc tests to find the contrast that best separates my time points.

I am unsure about step #3. In my case, the factors are individual time points, but I hypothesize that the time points are grouped into larger groups or stages (which I'd like to discover). From my limited understanding of design of experiments, it seems like Tukey's HSD doesn't apply to me because I am interested in more than pairwise comparisons between time points. The time points are not really treatments; it's the larger stages that interest me. Would Scheffé's test be the correct approach since it is "testing all possible contrasts"? http://www.biochemia-medica.com/2011/21/203

Follow up thought: I now understand Scheffé's test to be a way to correct p-values after multiple hypotheses testing. In my case I am using post-hoc tests to find which partitioning of time points created the "treatment" difference that passed ANOVA, not really in how significant that partitioning is. Is p-value correction needed? Is it possible that for a peak that ANOVA found to be differential, no contrast passes Scheffé's test? Can I just apply all contrasts I'm interested in and choose the one with the lowest (uncorrected) p-value as the way to decide which time points are "high" in step #3 (see the pseudocode below)?

library('edgeR')
...
DGE.fit <- glmFit(DGE.obj, my.design.matrix) # No intercept term in design matrix
# ANOVA-like test
anova.results <- glmLRT(DGE.fit, contrast=c("D1-D2", "D1-D3", ..., "D1-D7"))

# Get rows that pass ANOVA-like test
anova.rows <- ( p.adjust(anova.results$table%PValue, method="BH") < 0.05 )

# create 2^(7-1)-1 contrasts*: (1,-1/6,-1/6,...,-1/6), (-1/6,1,-1/6,...,-1/6), ..., (1/2,1/2,-1/5,-1/5,..,-1/5), etc.
# *Number of ways to divide 7 time points into 2 non-empty partitions = 63.
all.contrasts <- ...a block of code...

# For loop over running glmLRT with each contrast.
glm.results <- apply(X=all.contrasts, FUN=function(c){glmLRT(DGE.fit[anova.rows, ], contrast=c)}, MAR=2)

# For each row/peak, find which column/contrast has the lowest p-value
glm.results.matrix <- do.call(cbind, lapply(X=subset.res, FUN=function(x){return(x$table$PValue)}))
which.contrast.best <- apply(glm.results.matrix, FUN=which.min, MAR=1)

Lastly, is there a better way to answer my question than brute force linear regression?

Can I do some arithmetic on the 7 signal averages at each peak to bin them into a high and a low bins? I've gotten the suggestion to use apply a cutoff to the DNase signal to determine when peaks have higher signal, but I don't know how to set a cut-off that incorporates locus-specific information about background or the mean signal/variance, which seems necessary and which csaw/edgeR appears to offer.

Thanks for reading this long question

Eric

next-gen MARS time-series DNAse-seq • 3.9k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by ejliaw ▴ 30

1

Entering edit mode

The question of which binary up-down pattern best fits my time-series data was explored with StepMiner in Sahoo et al., "Extracting binary signals from microarray time-course data", Nucleic Acid Research (2006) 35 (11):3705-3712. http://nar.oxfordjournals.org/content/35/11/3705.long

That paper used "adaptive regression" to fit 4 kinds of binary patterns, but I'm also interested in all the other possible patterns rather than lumping them into an "other" category. Is anyone familiar with adaptive regression who can tell me how to adapt the StepMiner algorithm to my situation? Or more reasonably, do you know of books, articles, R packages w/ tutorials where I can learn about adaptive regression, or of other published methods similar to StepMiner?

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by ejliaw ▴ 30