[R] Same Analysis but different results ?
Entering edit mode
6 weeks ago
ali ▴ 20

Hi everyone. Im doing an analysis (DGE) on a count matrix (samples as columns and genes as rows). I have 7 classes ( 6 tumor sample classes and 1 healthy control). I need to do single tumor vs healthy control for each tumor. In order to go faster and avoid to do everything for each class I have done the fit of all the count matrix (DGEList) and then selected the specific contrast (6 contrasts in total).

   #y is the DGEList with the count matrix with all classes so the design has 7 columns
    fit.multi.class <- glmFit(y, design)

    # build the contrast 
    contrast <- makeContrasts(
      BVsH=Breast - HC, #Breast - Healthy Control
      CVsH=CRC - HC, #CRC - Healthy Control
      GVsH=GBM - HC, #GBM - Healthy Control
      HVsH=Hepatobiliary - HC, #Liver - Healthy Control
      LVsH=Lung - HC, #Lung - Healthy Control
      PVsH=Pancreas - HC, #Pancreas - Healthy Control

    # Breast vs HC
    lrt.brca.hc <- glmLRT(fit.multi.class, contrast = contrast[,"BVsH"])

    # CRC vs HC
    lrt.crc.hc <- glmLRT(fit.multi.class, contrast = contrast[,"CVsH"])

    # GBM vs HC
    lrt.gbm.hc <- glmLRT(fit.multi.class, contrast = contras[,"GVsH"])

    # HBC vs HC
    lrt.hbc.hc <- glmLRT(fit.multi.class, contrast = contrast[,"HVsH"])

    # NCSLC vs HC
    lrt.lung.hc <- glmLRT(fit.multi.class, contrast = contrast[,"LVsH"])

    # Pancreas vs HC
    lrt.paad.hc <- glmLRT(fit.multi.class, contrast = contrast[,"PVsH"])

Everything works well except a big problem. In order to test if this method was actually really giving me the single contrasts I asked , I decided to test it on GBM Vs HC. So As shown below I obtained a sub-matrix with only GBM and HC samples, and then applied the contrast and the results are different! How is possible , I thought the operation I made above would give me the same PValue/FDR , I know that above I did a "multi-class" fit but I thought that the contrast was exactly for giving me PValue for each single contrast. I dont know now which should I use for my analysis.

#y is the count matrix with only GBM and HC samples
fit <- glmFit(y, design)

# build the contrast 
contrast <- makeContrasts(GBM - HC,levels=colnames(design))

# perform likelihood ratio test for differential expression
lrt <- glmLRT(fit,contrast = contrast)

lrt in the end comes out different from lrt.gbm.hc computed before in terms of PValues/FDR.

R edgeR GEO • 249 views
Entering edit mode
6 weeks ago

Let me see if I understand the problem correctly:

  • you start with a bit counts matrix containing data for several diseases
  • first, you fit glmLRT on the whole matrix, then extract the contrast for GBM - HC
  • in parallel, you subset this matrix to get only data for GBM and HC, then fit the model GBM - HC
  • you observed different results for the two models (how much different?)

In principle yes, this is perfectly normal and expected, and you should preferentially use your first analysis with the big counts matrix, although the second analysis is not wrong either.

I am not an edgeR expert, but I see that glmLRT fits an ANOVA on the coefficients to see if any of them are different from zero. In general, with ANOVA, more data allows modeling the variance more accurately, even if it is data from subgroups that you are not interested in extracting in your contrasts. In layman's words, fitting the model on a bigger dataset will improve the way variance is modeled and make your predictions more accurate. Your model will have more information on the expected variance of each gene across all samples, and this will allow a better estimate on whether the difference observed between GBM and HC is significant or within the expected variance for the gene.

There is one thing to check on your code, though. Could you paste the code you used you create the sub-matrix with only GBM and HC samples? Just to check if it has been done properly, sometimes R's syntax for subsetting is tricky.

Entering edit mode

First of all thank you for your reply , it is really clear and well explained. Your understanding is correct. The result in terms of classification are better with the first method indeed , I was just scared was not right. The differences are not to much , for instance the top genes are basically the same , but this little differences brings in the end some genes that are not in both. Here is the code for the sub-matrix :

# selecting tumor of interest
tumor <- "GBM"
#sample.cond has all names of samples with its respective class name
sample.cond <- sample.cond[sample.cond$class == tumor |sample.cond$class == "HC" ,]
counts <- counts[,rownames(sample.cond)]

group <- factor(sample.cond$class) #Factor of 2  GBM HC 

y <- DGEList(counts = counts, group = group)

Last question , I do the normalization of the count matrix before actually do the fit (method = "TMM") and of course doing the normalization on the full matrix (with all classes) produce slightly different results than doing it on the sub-matrix (2 classes) , this is normal right?

Entering edit mode

NP! Your code for subsetting looks correct to me. It makes sense that the differences are not too much. The genes that appear significant in the second method but not in the first may have high variance. For the TMM normalization, I think it should be normal because TMM uses a scaling factor computed on the mean between each pair of samples


Login before adding your answer.

Traffic: 1262 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6