Question: Analysis with no Differentially Expressed Genes?
1
gravatar for vitor.saldanha11
10 months ago by
vitor.saldanha1110 wrote:

Hi Guys, I'm running my analysis for differential expression and the output give me no differentially expressed genes. Is there something wrong with my code or that's the true output of my samples?

First I imported the phenotype data:

phenoData <- read.AnnotatedDataFrame("design.txt", row.names = NULL, sep = "\t")

An object of class 'AnnotatedDataFrame' rowNames: 1 2 ... 6 (6 total) varLabels: Samples Treatment varMetadata: labelDescription

Then I ran the RMA:

eset <- justRMA("CN1_L10_T1.CEL","CN2_L10_T1.CEL","CN3_L10_T1.CEL",
            "Ti50_1_L10_T1.CEL", "Ti50_2_L10_T1.CEL", "Ti50_3_L10_T1.CEL",
            phenoData=phenoData)

I constructed the model matrix:

sample <- factor(rep(c("Cont", "Ti50"), each=3))
design <- model.matrix(~0 + sample)
colnames(design) <- levels(sample)
design

Cont Ti50
1    1    0
2    1    0
3    1    0
4    0    1
5    0    1
6    0    1

attr(,"assign") [1] 1 1 attr(,"contrasts") attr(,"contrasts")$sample [1] "contr.treatment"

I constructed the contrast matrix:

contrasts <- makeContrasts(Diff = Cont - Ti50, levels = design)
contrasts

Contrasts Levels Diff Cont 1 Ti50 -1

Then I ran the analysis:

fit <- lmFit(eset, design)  
fit2 <- contrasts.fit(fit, contrasts)
efit <- eBayes(fit2)        
deg <- topTable(efit, coef="Diff", p.value = 0.05, lfc = log2(1.5), adjust.method = "fdr", number = nrow(eset))   
deg

data frame with 0 columns and 0 rows

I noticed that if I put p-value=1 I have 500 degs. So I investigated:

range(deg$adj.P.Val)

0.7866448 0.8526774

Indeed the threshold cuts all my genes. That's the true output ( no genes differentially expressed)? Or I did something wrong?

ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 10 months ago by vitor.saldanha1110
1

Did you inspect the MDS plot (plotMDS() function from limma)? Could you post the image here?

ADD REPLYlink modified 10 months ago • written 10 months ago by h.mon24k
1

Though you have used established and well used topTable function, Limma developers discourage such practice. Copy/pasted from their manual:

Although topTable enables users to set p-value and lfc cutoffs simultaneously, this is not generally recommended. If the fold changes and p-values are not highly correlated, then the use of a fold change cutoff can increase the false discovery rate above the nominal level. Users wanting to use fold change thresholding are usually recommended to use treat and topTreat instead.

topTable does double filtering (by lfc and p-value(. There are/were considerable discussions among bioinformatics communities regarding this. To address this issue, limma recommends treat and topTreat method in case of filtering by fold change.

ADD REPLYlink written 10 months ago by cpad011211k
1

Just one thing to follow from this: For the lfc parameter to topTable(), by specifying log2(1.5), you are actually setting the threshold to absolute 0.58496250072 (log2(1.5)=0.58496250072).

Also, as I've implied in my own answer, I seriously doubt the utility of doing a differential expression analysis using just 6 samples. If this is pilot data for a grant, then fair enough, but even then you could have aimed for a higher sample n. I'm not even sure that you should be looking or p-values.

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

In one of the papers on fewer chips (3 was the least authors took) on pooled samples, conclusion was that for pooled samples, fewer chip numbers are on par with higher chip numbers with non pooled samples. I am not sure if it is general observation. However OP seems to have p and adjusted p on high side. I think OP should throw more light on top treat results and see if they are reported the same way in OP area of research

ADD REPLYlink written 10 months ago by cpad011211k

Yes indeed, you are the master cpad0112. Unfortunately the user may never come back to provide further info, which means that we can all just speculate.

Edit: I should add: good luck publishing in any reputable journal with just 3 samples going into a differential expression analysis.

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

As much as you are knowledgeable, you are also being generous to others. I guess most of the recent posts are from beginners and do not read instructions laid by admins

ADD REPLYlink written 10 months ago by cpad011211k

No, seriously, your coding skills in R are pretty great. Your point on pooled and non-pooled also makes sense. >30 years ago, one could probably publish a microarray study in Nature on a few samples - I'm sure they have.

Back in the day (18 years ago): a microarray study on <100 samples that revolutionised breast cancer forever: https://www.ncbi.nlm.nih.gov/pubmed/10963602

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

In one of the papers on fewer chips (3 was the least authors took) on pooled samples

Could you provide the reference?

ADD REPLYlink written 10 months ago by h.mon24k

I have high confidence that the issue is your low sample numbers. With a 3 versus 3 comparison and total sample n=6, it was always going to be difficult to identify genuine DEGs. If it were imbalanced, e.g., 3 versus 1, then you may have seen DEGs, but they would have been unreliable calls.

Edit: I neither encourage the conducting of a differential expression analysis using such low sample numbers.

The best that you can consider for this small study is to just look at ratios, i.e., ratio of each gene in Ti50 compared to Cont, and then order them by the highest to lowest ratio. I would forget about trying to obtain a p-value.

For each gene (pseudocode): mean(Ti50) / mean(cont)

----------------------------------------------------

Another thing that you could try is to divide your dataset into cont and Ti50, transform each to Z-scale, and then look at the genes highly and lowly expressed in each group by using a cut off of absolute Z>3 (i.e. 3 standard deviations away from the mean).

Just some suggestions.

Kevin

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

Do you have references to back up your claim that nothing will ever come out of a microarray-based transcriptomics study comparing 3 vs. 3? The breast cancer study you've referenced my have had a very different goal than vitor.saldanha, which is why I find it difficult to embrace your general comment without a benchmarking reference.

I've mostly dealt with RNA-seq data where the current recommendation is 6 samples per condition (Schurch et al.), but most researchers still get by and get published with 3 replicates per condition. And in the case of well-controlled model organisms such as mice, worms etc. I don't think one would expect to see absolutely nothing if the experiment was well designed.

ADD REPLYlink written 10 months ago by Friederike3.2k

Hello Friederike. You may be misinterpreting / misrepresenting what I said. It is clear that one can identify differentially expressed genes from a 3 vs 3 design (even I have done it); however, I question the utility and broader applicability of doing such a thing. In a study where you have a heterogeneous sample mix, then one will struggle; however, in a gene knockdown study, one will easily identify DEGs from such a design.

Have a nice afternoon.

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k

I did not mean to misrepresent what you said, thanks for clearing that up. I mostly wanted to point out that there might be reasons for why the OP is not detecting DEG which might be worth exploring. The replies you gave left me with the impression of generally discouraging these types of analyses when one had only 3 replicates -- when a lot of research efforts have gone into addressing the problems of calculating p- and q-values based on low sample numbers. I agree, one should not draw up new treatment guidelines for breast cancer patients based on a single 3 vs 3 comparison, but I'm almost certain that's not what the OP had in mind with this analysis.

ADD REPLYlink modified 10 months ago • written 10 months ago by Friederike3.2k

Yep, well, the OP has never responded to anything to clarify, but has been logging in to read the comments. So, we're all just speculating about her/his study design. Technically I'd say that every comment answer is correct when applied to a certain situation. The OP may only have upvoted and accepted the answer that s/he wanted to hear.

¯_(ツ)_/¯

ADD REPLYlink written 10 months ago by Kevin Blighe39k

I second h.mon's comment: what does your data look like in the global analyses such as PCA or MDS? If you do not see a clear grouping that follows the design of your experiment, most likely you're not going to get any DE genes. The main task should then be to find out whether that's due to noise or to possibly confounding factors.

ADD REPLYlink written 10 months ago by Friederike3.2k
3
gravatar for Charles Warden
10 months ago by
Charles Warden6.4k
Duarte, CA
Charles Warden6.4k wrote:

It is definitely possible to have genes identified with a 3 x 3 comparisons, but I'm not sure of how clear the Ti50 effect is.

I would usually expect limma to be a little more sensitive than the built-in R functions (like aov() for ANOVA and lm() for linear-regression), although some of those pre-processing functions aren't exactly the same as I as expecting.

Perhaps you could try testing some functions this this demo dataset / tutorial?

https://www.bioconductor.org/help/course-materials/2005/BioC2005/labs/lab01/estrogen/

I don't believe I've followed this exact tutorial, but I would expect this dataset to have some genes identified with a low FDR (and usually an example dataset with clear expression changes would be a good idea for initially testing code).

ADD COMMENTlink written 10 months ago by Charles Warden6.4k

For the oestrogen data, another tutorial to follow by Gentleman and Huber: http://www.bioconductor.org/packages//2.12/data/experiment/vignettes/estrogen/inst/doc/estrogen.pdf

In it, they stay clear of trying to define p-values via the standard limma method and instead focus on effect sizes and variation, which is more appropriate for low sample n. Oestrogen is such a potent hormone, however, that one would indeed see differences with just a 3 versus 3 study design. For other more general studies, as mentioned in my answer, one will struggle to get useful stats from such a low sample n.

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2494 users visited in the last hour