Question: Advice on RNASeq analysis without biological replicates for differential gene expression.
5
gravatar for Christos Gekas
4.9 years ago by
Spain, Barcelona, IMIM(PRBB)
Christos Gekas50 wrote:

Hi everybody, 

I've been lurking for a month learning about RNASeq analysis but now it's time to post my first question!

Very briefly I have the following situation that I would love to have some insight about: 

 

Setup: 

RNAseq (HiSeq, single reads, 50bp) on 4 cell lines, untreated or treated. 

High-quality 44bp reads after fastqc filtering and trimming, 20-27 million mapped reads on human, tophat2-mapped using hg38.gtf strict. 

 

Problem: 

We originally assumed that the cell lines would be sufficiently similar to each other to be considered "biological replicates" and we would thus see the effects of the treatment on the transcriptome. However, this turned out to be a wrong assumption and they are quite different globally, so the effects of the treatment are far smaller than being different cell lines (I've seen this by clustering normalized reads (from HTSeq) or FPKM values from Cufflinks, and running a PCA or doing heatmaps). 

 

Question: 

What's the best way to analyze RNASeq without biological replicates? 

So far I've tried 

Cuffdiff using "-blind" mode, doesn't seem to yield anything different than default setting. 

DESeq2, seems to work a bit better but severely underestimates changes (that we already saw by QPCR on individual genes).

So I was wondering if you think it's possible / "allowed" to do fold-difference values on normalized reads (HTSeq and using sizeFactors from DESeq2) pair-wise of untreated and treated of the 4 samples but what statistics can I use for that to deteremine significant changes? Or is it better to use FPKM values from Cufflinks, since they already come with a built-in confidence interval and simply calculate folds pairwise and cluster the results ? 

 

Grateful for any kind of insight or advice!

ADD COMMENTlink modified 2.9 years ago by sysbiocoder170 • written 4.9 years ago by Christos Gekas50
5
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

You might give GFold a try. It's about the best you can do with unreplicated data. Having said that, if you can redo your experiment and use different passages of the cell lines as biological replicates then you'd get the most reliable results.

Your other option might be DEGseq or another model that assumes Poisson variance. That tool is usually a terrible option, but for a cell-line it might not perform that badly (it'll probably over-estimate the number of differences, but it might not do so too badly).

ADD COMMENTlink written 4.9 years ago by Devon Ryan92k

What you say about poisson distribution is interesting, because when I use Cuffdiff with --poisson-dispersion I do get a sizeable amount of differentially expressed genes (~5000) (compared with only 50 or so using default dispersion estimation).

 

So, you don't think anything can be gained by using reads or FPKM values (using a cutoff threshold obviously to minimize low expressed genes / noise)  and just going for genes showing consistent up / dn regulation in the four different cell lines? 

The reason I'm asking is that I've done this, and gene set enrichment analysis (GSEA) on the ups and downs seem as what we would expect from the biological question, but I am still not sure if my method is "kosher".. 

 

I will take a look at GFold and DEGseq, thanks for the advice!

ADD REPLYlink written 4.9 years ago by Christos Gekas50
3

I'd forgotten that cuffdiff had the --poisson-dispersion option :) That should produce similar results to DEGseq. What you're doing now (GSEA on consistent fold changes) is probably a decent route given the data you have at the moment. Having said that, if you tried to publish the results and I were one of the reviewers then the paper would get immediately rejected (unless treating the cell-lines was somehow extremely difficult/expensive...and the bar for that would be relatively high).

Sequencing is pretty cheap these days and cell lines are generally not that hard to come by. Before spending too much time on a questionable analysis, you might as well just redo the experiment in a publishable form and use non-questionable methods to analyze the results. You might end up with the same results either way, but this way you won't have problems later on.
 

ADD REPLYlink written 4.9 years ago by Devon Ryan92k

Thumbs up for this advice. I get cold chills down my spine every time I see someone try to claim something with <3 (no, that's not a heart) biological replicates.

ADD REPLYlink written 4.9 years ago by David Westergaard1.4k

So, I think obviously everybody realizes that 'more biological replicates' is better than 'less biological replicates'. However, could you or Devon or someone else ELI5 (explain like I'm 5 yrs old) why a method like GFOLD is not to be trusted? I've gone through the GFOLD paper (http://bioinformatics.oxfordjournals.org/content/28/21/2782.full.pdf+html) where the authors do quite comprehensive comparisons between the then available methods (deseq, edger, cufflinks, fold change with offset etc) and their results on no-biological-replicates seem pretty solid to be discarded without a more qualified argument. 

Also, I'd like to remind that in our setup we are analyzing the effects of treatment on 4 distinct, but different-enough cell lines, and we're interested in common/consistent effects among the 4. 

ADD REPLYlink written 4.9 years ago by Christos Gekas50

You're interested in measuring differences between two biologically variable groups. It is simply impossible to accurately gauge biological variability. No one will ever conceive of a perfect method to do this. The examples used in the GFOLD paper are..."interesting". It should be noted that when a Poisson model (aka DEGseq) performs better than edgeR that the dataset is very atypical.

So trust GFOLD for as much as it's worth. If you're intent is only to find some candidates that will be undergoing thorough validation then GFOLD is probably OK.
 

ADD REPLYlink written 4.9 years ago by Devon Ryan92k

I disagree.  I would argue that William Stanley Gosset solved this problem about a hundred years ago.

It's a paired t-test.

ADD REPLYlink written 4.9 years ago by Michele Busby2.1k
1

And test what with it? That two samples are different from each other? They will be, but that's not interesting. A paired t-test solves a completely different problem.

ADD REPLYlink written 4.9 years ago by Devon Ryan92k

I should add that I assume the paired T-test would be pairing treated and untreated within cell-line. If the cell lines are indeed actually quite different, then this won't work due to cell-line specific changes due to the treatment (you'd miss all of them). If there are a reasonable number of consistent changes due to treatment then that'd work, but one could also just use a regular GLM (~line+treatment). If the goal in using multiple cell-lines was to find more robust changes then that would work OK, assuming the FDR correction doesn't turn out to be too brutal. Of course if the goal is now to find within cell-line changes (as suggested in the title), then that won't help at all.

ADD REPLYlink written 4.9 years ago by Devon Ryan92k

They are interested in what genes change expression in response to the treatment.  To do that they used four biological replicates (different cell lines) but because the cell lines had so much variability their experiment was under powered to detect differences due to the treatment effect rather than the differences in cell line. That's a textbook example of when you use a paired t test.

Using different cell lines actually blocks against cell-line specific effects and gets at the what the treatment is doing.  Those changes, if the treatment is actually doing something, should be robust.  There isn't a problem with lack of replication. They are blocking on a wide range of biological variables by using different cell lines.  That is how experiments should designed.

ADD REPLYlink written 4.9 years ago by Michele Busby2.1k

I agree with Devon. paired t-test would not be appropriate

ADD REPLYlink written 4.9 years ago by Manvendra Singh2.1k
2
gravatar for Manvendra Singh
4.9 years ago by
Manvendra Singh2.1k
Berlin, Germany
Manvendra Singh2.1k wrote:

any attempt to work without any replicates will lead to conclusions of very limited reliability.

Its written in DESeq package that 

"the mean is a good predictor for the
variance. Hence, if a number of genes with similar expression level are
compared between replicates, then their variation is expected of
comparable magnitude. Once we accept this assumption, we may argue as
follows: Given two samples from different conditions and a number of genes
with comparable expression levels, of which we expect only a minority to
be influenced by the condition, we may take the variance estimated from
comparing their count rates across conditions as ersatz for a proper
estimate of the variance across replicates. After all, we assume most
genes to behave the same within replicates as across conditions, and
hence, the estimated variance should not change too much due to the
influence of the hopefully few differentially expressed genes."

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Manvendra Singh2.1k
1
gravatar for Michele Busby
4.9 years ago by
Michele Busby2.1k
United States
Michele Busby2.1k wrote:

I would try normalizing all the data by some reasonable factor (e.g. start with a comparison of the median read count in each sample) and then see if a paired t-test gives you a sensible result.

This is a different situation than having no replicates.

http://www.biostathandbook.com/pairedttest.html

This is a very standard approach.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Michele Busby2.1k

Thanks for the answer Michele, I'll try it. 

in your opinion, which metric should I be comparing? normalized read counts, TPM, FPKM ? 

 

Thanks!

ADD REPLYlink written 4.9 years ago by Christos Gekas50

For the statistic it matters most how you normalize your reads by the overall read depth.  If you are calculating FPKM or RPKM you are de facto normalizing by the total read count.

The way DESeq and some other packages work is that they normalize by a scaling factor based on the median read count, which will, in outlier cases, be more robust than normalizing by total read count.

But using FPKM or RPKM is really nice because it is the output of packages like RSEM which can do some acrobatics and call transcript abundances.  And you may not have an outlier case.

So I would start with the FPKM output by RSEM or something.  Then from there it's fairly easy to see if the normalization by total reads is appropriate.  Just take the FPKM value for each comparison and plot them with one on the X axis and its partner on the Y axis.  You should be able to draw a line of best fit right through the middle and it should go with a slope of 1.

So do that for each pair and if it looks reasonable I'd say you are OK with FPKM.  

If it doesn't work (for example if your data has a huge outlier in one sample) you'll have to do something more complex. But not much more complex.  A scaling factor based on the median FPKM usually works OK.

So in a recent paper I did I tried both approaches and FPKM worked great for something like 96 out of the 98 samples.  It's not like microarrays were.

For differential expression it doesn't matter if you normalize by length or not because you are just dividing both sides by the same number.

Don't forget to calculate your FDR or do a multiple testing correction.

ADD REPLYlink written 4.9 years ago by Michele Busby2.1k
1
gravatar for Antonio R. Franco
2.9 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.1k wrote:

Take a look to the NOISeq R package It is designed to compare two RNA-Seq experimental designs with no parametric assumptions, and no replicates are required

ADD COMMENTlink written 2.9 years ago by Antonio R. Franco4.1k
1
gravatar for sysbiocoder
2.9 years ago by
sysbiocoder170
sysbiocoder170 wrote:

Can give audic method a try ususally it was used for microarray based analysis, but is applicable for RNAseq also Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Res 1997; 7:986–995.

I have deposited R-script for audic method in github. https://github.com/sysbiocoder/Audicmethod

ADD COMMENTlink written 2.9 years ago by sysbiocoder170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2106 users visited in the last hour