Question

Differential Expression Help

0

Entering edit mode

5.0 years ago

mlai2567 • 0

Hi all,

I'm trying to analyze an RNA-seq dataset that was just run. It consists of 3 conditions (Control, Treatment 1, Treatment 2) with 2 biological replicates (Cell line #1, Cell line #2) each. However, because the biological replicates are from cell lines of different passage number, there is variation in the base level of gene expression, even within the controls, making it difficult to conduct differential expression analysis across the replicates. For example, the expression value for the control of Cell line #1 may 1, but the control of cell line #2 may be 2. Thus, the expression values of the 2 treatments will vary accordingly, making it difficult to conduct edgeR analysis as there are a low number of differentially expressed genes with a FDR<0.05.

Is there a way to measure differential expression within each cell line individually? For example, is there a way to see which genes are differentially expressed between treatment 1 and the control for cell line #1, then seeing if these genes are also differentially expressed between treatment 1 and the control for cell line #2?

Thanks in advance.

Cheers, Michael

RNA-Seq Differential Expression edgeR • 1.3k views

ADD COMMENT • link updated 5.0 years ago by Biostar 20 • written 5.0 years ago by mlai2567 • 0

0

Entering edit mode

Any reason you don't want to just run the analysis twice (once for each cell line), and then look for overlap between the two?

ADD REPLY • link 5.0 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

That was an option I've been considering. However, when I do that, the analysis runs as if there was no replicate, which I wasn't sure was acceptable or not?

ADD REPLY • link 5.0 years ago by mlai2567 • 0

1

Entering edit mode

It is not. Still, n=2 is not great either. The point with replication is to accurately estimate the variance between the genes in order to properly distinguish between technical and intra-group variation from intergroup variation (=the biological effect). Your experiment in most likely (and simply) underpowered in terms of replicate numbers. The fact that you use different passage numbers only makes things more complicated as it adds additional technical variation. Without knowing technical details, a better design would have probably been to take the exact same cell line, subculture like three batches independently for a certain time and then use these three batches for a e.g. n=3 experiment. It is of course easy to say that but money, time and effort make things complicated, I am aware of that ;-) Still, statistics unfortunately does not care about the reason why an experiment is underpowered. Replication number depends on the effect sizes you expect and want to identify as significant. Smaller effect sizes requires more replication than larger ones.

ADD REPLY • link 5.0 years ago by ATpoint 81k

1

Entering edit mode

Ah, I see. Ideally, you'd have a few technical replicates for each cell line, but I know it's not always possible. Regardless, you can adjust your GLM to account for differences between the cell lines and only try to identify changes due to the treatments.

You can read section 4.2 of edgeR's user guide for an example (they adjust for batch effects, but the method is the same). It's just a simple adjustment to your design matrix and sample groupings.

ADD REPLY • link 5.0 years ago by jared.andrews07 ★ 16k

1

Entering edit mode

Also as a note, and to add to what @ATpoint rightfully mentioned, you shouldn't expect all that many traditionally "significant" results with only 2 samples. The changes would have to be very consistent and quite strong, so you may have to rely on something like a fold change cutoff. Reading the edgeR manual for the "I don't have any replicates" section (easily found in the table of contents) would likely do you good as well and explain pretty much all of the options available to you..

ADD REPLY • link 5.0 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Thank you both for your responses! I understand that the experimental setup was not ideal given my limited resources, and will hopefully be able to perform additional replicates for each condition in the future. However, because I will not be able to sequence additional samples for some time I will proceed with the analysis of the data regardless.

As of right now, I think my best bet is to perform edge R analysis with a focus on fold change. I've set the parameters as log fold change >1.5 and FDR<1 (essentially making significance negligible?). I'm currently using Galaxy to perform these analyses, as most of my training is as a molecular biologist and have relatively limited programming knowledge. Does this fold change cutoff of greater than 1.5 seem like my best given my situation?

I was also considering the program GFold (https://zhanglab.tongji.edu.cn/softwares/GFOLD/index.html) as I had seen that it allowed for differential analyses to be run without replicates; however, I feel that I lack the programming knowledge necessary to run the algorithm. Would something like this program be suitable if I were to run analyses on each cell line separately and subsequently compare the results?

ADD REPLY • link 5.0 years ago by mlai2567 • 0

0

Entering edit mode

You can of course technically do that but you will get a lot of false-positives with log2 > 1.5 and FDR < 1. Genes with small counts will have artificially high fold changes (mean-variance dependency). If you want to relax the cut-offs, at least do something like FDR 10 or 15% but not more. I am not too familiar with edgeR but DESeq2 offers the possibility to use shrinkage estimators to more accurately estimate the true log2FCs. This will get rid of high fold changes based on small counts. Maybe this approach with a relaxed FDR of 15% will get you some candidates you can carefully proceed with, if possible validated by qPCR.

ADD REPLY • link 5.0 years ago by ATpoint 81k