Advice on RNASeq analysis without biological replicates for differential gene expression.
5
6
Entering edit mode
9.4 years ago

Hi everybody,

I've been lurking for a month learning about RNASeq analysis but now it's time to post my first question!

Very briefly I have the following situation that I would love to have some insight about:

Setup

RNAseq (HiSeq, single reads, 50bp) on 4 cell lines, untreated or treated.

High-quality 44bp reads after fastqc filtering and trimming, 20-27 million mapped reads on human, tophat2-mapped using hg38.gtf strict.

Problem

We originally assumed that the cell lines would be sufficiently similar to each other to be considered "biological replicates" and we would thus see the effects of the treatment on the transcriptome. However, this turned out to be a wrong assumption and they are quite different globally, so the effects of the treatment are far smaller than being different cell lines (I've seen this by clustering normalized reads (from HTSeq) or FPKM values from Cufflinks, and running a PCA or doing heatmaps).

Question

What's the best way to analyze RNASeq without biological replicates?

So far I've tried

Cuffdiff using -blind mode, doesn't seem to yield anything different than default setting.

DESeq2, seems to work a bit better but severely underestimates changes (that we already saw by QPCR on individual genes).

So I was wondering if you think it's possible / "allowed" to do fold-difference values on normalized reads (HTSeq and using sizeFactors from DESeq2) pair-wise of untreated and treated of the 4 samples but what statistics can I use for that to determine significant changes? Or is it better to use FPKM values from Cufflinks, since they already come with a built-in confidence interval and simply calculate folds pairwise and cluster the results?

Grateful for any kind of insight or advice!

HTSeq Cuffdiff Cufflinks DESeq2 RNA-Seq • 20k views
ADD COMMENT
5
Entering edit mode
9.4 years ago

You might give GFold a try. It's about the best you can do with unreplicated data. Having said that, if you can redo your experiment and use different passages of the cell lines as biological replicates then you'd get the most reliable results.

Your other option might be DEGseq or another model that assumes Poisson variance. That tool is usually a terrible option, but for a cell-line it might not perform that badly (it'll probably over-estimate the number of differences, but it might not do so too badly).

ADD COMMENT
0
Entering edit mode

What you say about poisson distribution is interesting, because when I use Cuffdiff with --poisson-dispersion I do get a sizeable amount of differentially expressed genes (~5000) (compared with only 50 or so using default dispersion estimation).

So, you don't think anything can be gained by using reads or FPKM values (using a cutoff threshold obviously to minimize low expressed genes / noise) and just going for genes showing consistent up / dn regulation in the four different cell lines?

The reason I'm asking is that I've done this, and gene set enrichment analysis (GSEA) on the ups and downs seem as what we would expect from the biological question, but I am still not sure if my method is "kosher"..

I will take a look at GFold and DEGseq, thanks for the advice!

ADD REPLY
3
Entering edit mode

I'd forgotten that cuffdiff had the --poisson-dispersion option :) That should produce similar results to DEGseq. What you're doing now (GSEA on consistent fold changes) is probably a decent route given the data you have at the moment. Having said that, if you tried to publish the results and I were one of the reviewers then the paper would get immediately rejected (unless treating the cell-lines was somehow extremely difficult/expensive...and the bar for that would be relatively high).

Sequencing is pretty cheap these days and cell lines are generally not that hard to come by. Before spending too much time on a questionable analysis, you might as well just redo the experiment in a publishable form and use non-questionable methods to analyze the results. You might end up with the same results either way, but this way you won't have problems later on.

ADD REPLY
0
Entering edit mode

Thumbs up for this advice. I get cold chills down my spine every time I see someone try to claim something with <3 (no, that's not a heart) biological replicates.

ADD REPLY
0
Entering edit mode

So, I think obviously everybody realizes that 'more biological replicates' is better than 'less biological replicates'. However, could you or Devon or someone else ELI5 (explain like I'm 5 yrs old) why a method like GFOLD is not to be trusted? I've gone through the GFOLD paper where the authors do quite comprehensive comparisons between the then available methods (deseq, edger, cufflinks, fold change with offset etc) and their results on no-biological-replicates seem pretty solid to be discarded without a more qualified argument.

Also, I'd like to remind that in our setup we are analyzing the effects of treatment on 4 distinct, but different-enough cell lines, and we're interested in common/consistent effects among the 4.

ADD REPLY
0
Entering edit mode

You're interested in measuring differences between two biologically variable groups. It is simply impossible to accurately gauge biological variability. No one will ever conceive of a perfect method to do this. The examples used in the GFOLD paper are..."interesting". It should be noted that when a Poisson model (aka DEGseq) performs better than edgeR that the dataset is very atypical.

So trust GFOLD for as much as it's worth. If you're intent is only to find some candidates that will be undergoing thorough validation then GFOLD is probably OK.

ADD REPLY
0
Entering edit mode

I disagree. I would argue that William Stanley Gosset solved this problem about a hundred years ago.

It's a paired t-test.

ADD REPLY
1
Entering edit mode

And test what with it? That two samples are different from each other? They will be, but that's not interesting. A paired t-test solves a completely different problem.

ADD REPLY
0
Entering edit mode

I should add that I assume the paired T-test would be pairing treated and untreated within cell-line. If the cell lines are indeed actually quite different, then this won't work due to cell-line specific changes due to the treatment (you'd miss all of them). If there are a reasonable number of consistent changes due to treatment then that'd work, but one could also just use a regular GLM (~line+treatment). If the goal in using multiple cell-lines was to find more robust changes then that would work OK, assuming the FDR correction doesn't turn out to be too brutal. Of course if the goal is now to find within cell-line changes (as suggested in the title), then that won't help at all.

ADD REPLY
0
Entering edit mode

They are interested in what genes change expression in response to the treatment. To do that they used four biological replicates (different cell lines) but because the cell lines had so much variability their experiment was under powered to detect differences due to the treatment effect rather than the differences in cell line. That's a textbook example of when you use a paired t test.

Using different cell lines actually blocks against cell-line specific effects and gets at the what the treatment is doing. Those changes, if the treatment is actually doing something, should be robust. There isn't a problem with lack of replication. They are blocking on a wide range of biological variables by using different cell lines. That is how experiments should designed.

ADD REPLY
0
Entering edit mode

I agree with Devon. paired t-test would not be appropriate

ADD REPLY
2
Entering edit mode
9.4 years ago
Manvendra Singh ★ 2.2k

Any attempt to work without any replicates will lead to conclusions of very limited reliability.

It's written in DESeq package that

the mean is a good predictor for the variance. Hence, if a number of genes with similar expression level are compared between replicates, then their variation is expected of comparable magnitude. Once we accept this assumption, we may argue as follows: Given two samples from different conditions and a number of genes with comparable expression levels, of which we expect only a minority to be influenced by the condition, we may take the variance estimated from comparing their count rates across conditions as ersatz for a proper estimate of the variance across replicates. After all, we assume most genes to behave the same within replicates as across conditions, and hence, the estimated variance should not change too much due to the influence of the hopefully few differentially expressed genes.

ADD COMMENT
1
Entering edit mode
9.4 years ago
Michele Busby ★ 2.2k

I would try normalizing all the data by some reasonable factor (e.g. start with a comparison of the median read count in each sample) and then see if a paired t-test gives you a sensible result.

This is a different situation than having no replicates.

http://www.biostathandbook.com/pairedttest.html

This is a very standard approach.

ADD COMMENT
0
Entering edit mode

Thanks for the answer Michele, I'll try it.

In your opinion, which metric should I be comparing? Normalized read counts, TPM, FPKM?

Thanks!

ADD REPLY
0
Entering edit mode

For the statistic it matters most how you normalize your reads by the overall read depth. If you are calculating FPKM or RPKM you are de facto normalizing by the total read count.

The way DESeq and some other packages work is that they normalize by a scaling factor based on the median read count, which will, in outlier cases, be more robust than normalizing by total read count.

But using FPKM or RPKM is really nice because it is the output of packages like RSEM which can do some acrobatics and call transcript abundances. And you may not have an outlier case.

So I would start with the FPKM output by RSEM or something. Then from there it's fairly easy to see if the normalization by total reads is appropriate. Just take the FPKM value for each comparison and plot them with one on the X axis and its partner on the Y axis. You should be able to draw a line of best fit right through the middle and it should go with a slope of 1.

So do that for each pair and if it looks reasonable I'd say you are OK with FPKM.

If it doesn't work (for example if your data has a huge outlier in one sample) you'll have to do something more complex. But not much more complex. A scaling factor based on the median FPKM usually works OK.

So in a recent paper I did I tried both approaches and FPKM worked great for something like 96 out of the 98 samples. It's not like microarrays were.

For differential expression it doesn't matter if you normalize by length or not because you are just dividing both sides by the same number.

Don't forget to calculate your FDR or do a multiple testing correction.

ADD REPLY
1
Entering edit mode
7.4 years ago

Take a look to the NOISeq R package It is designed to compare two RNA-Seq experimental designs with no parametric assumptions, and no replicates are required

ADD COMMENT
1
Entering edit mode
7.4 years ago
sysbiocoder ▴ 180

Can give audic method a try ususally it was used for microarray based analysis, but is applicable for RNAseq also Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Res 1997; 7:986–995.

I have deposited R-script for audic method in github. https://github.com/sysbiocoder/Audicmethod

ADD COMMENT

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6