How to perform multiple comparisons using edgeR?
2
0
Entering edit mode
23 months ago
silas008 ▴ 140

Hello, guys.

I did not understant one thing. I found some posts regarding this question but I couldn't find an answer for my question.

If I have 4 groups, each one with 3 replicates: 1,1,1,2,2,2,3,3,3,4,4,4, and I want to compare all of them against each other, how can I do that?

For exemplo:

x <- read.delim("table.csv",row.names="genes")
group <- factor(c(1,1,1,2,2,2))
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
et <- exactTest(y)


In this case, I am comparing groups 1 and 2. Them I would do that for the other comparisons (like group <- factor(c(3,3,3,4,4,4))). Can I do that this way?

Or, instead of that should I do something like this?:

x <- read.delim("table.csv",row.names="genes")
group <- factor(c(1,1,1,2,2,2,3,3,3,4,4,4))
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
fit <- glmQLFit(y,design)
qlf_quasi <- glmQLFTest(fit,coef=2:4)
topTags(qlf_quasi)


The problem is that I don't know exactly how to use coef for this kind of comparisons. coef=2:4 will compare all groups against the control group, but I do not have a control group, I want to compare everything (or in other cases I want to choose specific comparisons. How can I choose exactly the comparisons I like to do?

Other problem is that when I use a code like this the output is a single table with multiple comparisons, but just one p-value and one FDR. I need separeted p-values for each comparison (like in the pos-tests for ANOVA or something like this).

It would be great if you can help me.

Thanks

RNA-Seq edgeR • 3.4k views
0
Entering edit mode

Hello, Charles. Thanks again.

1
Entering edit mode
23 months ago

I think there is a way to do this in edgeR, but I would probably need to refresh my memory about what those headers look like (I also thought you were essentially combining those coefficients, but I want to be careful about telling you that for certain, and I'm guessing that you've checked what happens if you specify everything except the intercept).

Strictly speaking, I would probably use glmQLFit(y, design, robust=TRUE) if I wanted to use edgeR-robust, but I would normally use glmFit() and glmLRT() more frequently in edgeR. However, I think that will essentially be the same in terms of getting 1 p-value.

You don't specify a coefficient with DESeq2, so perhaps that is something else that you can consider?

If you have clear expression changes, you may even be able to use the R-base functions for an ANOVA on log2(FPKM + 0.1) values:

gene.aov = function(arr, var1)
{
fit = aov(as.numeric(arr) ~ var1)
result = summary(fit)
aov.pvalue = result[[1]][['Pr(>F)']][1]
}
...
genes.pvalue = apply(logFPKM.mat, 1, gene.aov, var1=group)

0
Entering edit mode

Hello, Charles. Thanks for helping. I am sorry for the late response, I was out of my city in a conference.

I think the best option in this case is to check if Deseq2 can do want I need. In the case that glmFit() and glmLRT() will also provide 1 p value.

Do you know exactly why to use exactTest() for each comparison is not a good idea?

Thank you very much

1
Entering edit mode

No problem - l also may not be able to sign onto Biostars each day, but hopefully somebody else can provide useful pointers (if needed).

DESeq2 doesn't use the same functions as edgeR or limma-voom. If you just run results() on your dds variable (the DESeqDataSetFromMatrix() object from the tutorials) with a defined contrast, I think you can do a Wald-test. Otherwise, I would typically use the DESeq() function where the test can either be "Wald" or "LRT," and then run results() to create a summary table. Strictly speaking, I think some details of the DESeq analysis change depending upon your sample size, but I think these are the main/common things that you can control for your differential expression test.

I would strongly recommend not running a comparison without replicates (even though you could do something like a Fisher's Exact Test on counts). However, the exactTest() function has a parameter for dispersion estimates, and you are specifying replicates for your comparison. So, this may not be as bad. I think what is most important is that you have some sort of expression (preferably independently calculated) to assess replicate clustering (and other QC metrics / visualization) for each project.

0
Entering edit mode

Hello

I think I will follow your first recommendation: glmFit() glmLRT(). edgeR support also suggest this one as a good option, not only for multiple comparisons but also for group2vsgroup1 comparisons. Since we will validate the best targets of the data by qPCR later, I think we can use it for more error control and less restrictions.

Thanks again. You are helping a lot

1
Entering edit mode

Ok - I'm not sure what is the relative effect of using glmQLFit() without robust=TRUE, and that probably varies between different datasets.

However, I now understand you are referring to those functions within edgeR (not DESeq2), and that makes more sense.

Also, in terms of your question about the exactTest(), Robinson et al. 2010 mention "differential expression is assessed for each gene using an exact test analogous to Fisher's exact test, but adapted for overdispersed data (Robinson and Smyth, 2008)". There is also brief mention of the 2008 paper in Robinson and Smyth 2007: "[briefly], the exact test works as follows. The same quantile adjustment that is used to adjust the tag counts to a common library size for estimation is used to construct the exact test. Using this pseudodata, we again use the fact that a sum of independent and identically distributed NB random variables is also NB. By conditioning on the total pseudosum (an NB random variable), we can calculate the probability of observing counts as or more extreme than what we observed, resulting in an exact P-value". So, if this doesn't capture biological variability between your replicates, then it is probably not preferable. However, I think this is a question that is probably best answered by the developer.

1
Entering edit mode

Thank you very much. I have used glmFit() glmLRT(). I think it is the best choice for my in this case.

Let me ask you a simple question. I know the normalization in edegR follows different rules than a simple normalization. But I can not understand something like this:

edgeR output

              logFC logCPM    LR       PValue    FDR
mmu-mir-667 10.3587 5.772   20.1057 7.3277E-06  0.0039


Raw counts

              c1   c2    c3    b1     b2    b3
mmu-mir-667   0   0    0    326     89     135


CPM counts

                  c1    c2   c3     b1         b2       b3
mmu-mir-667       0     0     0    184.1    40.9    91.5


The LogFC for this miRNA is 10.3587, according to edgeR. And LogCPM is 5.7727. I understand that I will not get the same result doing a normal recalculation because this is based on mglmOneGroup() function. But my question is how can edegR calculate the FoldChange if in this comparison the counts for all replicates in one group is iqual zero?

1
Entering edit mode

To be honest, I use an independently calculated expression value for the fold-change (and to visually inspect concordance for replicates and other QC assessments). In other words, I would ideally test edgeR / DESeq2 / limma-voom for each project, but I am only using those programs for the p-value (not the fold-change).

For edgeR, I think it would be best to ask the developer. If you export the pseudocounts from edgeR, you may be able to just add 1 to be able to do a log-transformation. I can see an 'offset.aug' variable in the logFC porition of the exactTest.R code, but the logFC part of the glmLRT() code in the glmfit.R file looks different. So, I don't think I am the best person to answer that question.

For the independently calculated fold-change, I would typically use log2(FPKM + 0.1) expression values and require that 50% of genes have FPKM values above 0.1 (although I may change that exact rounding factor - for example, if I want to focus on the more highly expressed genes).

If you have a rounding factor, that allows you to calculate a log2ratio (and it helps avoid getting large fold-change values for genes with low expression levels).

I would then calculate a linear fold-change from the log2ratio as 2^log2ratio (if the log2ratio > 0) or -2^(-log2ratio) (if log2ratio < 0).

1
Entering edit mode

Hey, Charles.

I will send I massage to the edgeR support. This is a little bit confuse to me.

Thank you very much again!!!

0
Entering edit mode

You are very welcome - I hope this helps!

1
Entering edit mode
23 months ago
ATpoint 49k

You can use contrasts instead of coefficients for multiple comparisons. Say you already have your glmQLFit (lets term it fit), the next step would be to define these contrasts, given you have a design of like ~0 + group:

CONTRASTS <- makeContrasts( Group1vs2 = group1-group2,
Group2vs3 = group2-group3,
levels = design )


Now test all contrasts, e.g. via a loop:

for (i in 1:ncol(CONTRASTS)){

current.glmQLFTest <- glmQLFTest(fit, contrast = CONTRASTS[,i])

(...code to save that current.glmQLFTest to a variable with the contrast name...)

}

0
Entering edit mode

Hello, AT point.

Thank you for helping.

What is the difference between using CONTRASTS followed by loop FOR and using different glmQLFTest() for each comparison (like group2 vs group1... group3 vs group2... etc)?

Thanks again

1
Entering edit mode

it is the same, the loop only saves you from typing each contrast individually but this is not required. Remove it if you want.

0
Entering edit mode

Excellent.

Thank you very much.