Question

RNA-Seq comparing transcript levels between samples and between experimental runs. How to design an experiment to be able to do this?

0

Entering edit mode

9.4 years ago

ncl.wong • 0

Hi there, the closest post related to my questions can be found here.

Long time reader, first time poster.

I have been looking into RNA-Seq and at data where I would like to identify differential transcript abundance between two samples that may or may not have been prepared in the same experimental run. For example public RNA-Seq data compared to our own in-house generated RNA-Seq data.

Can they be compared? The more I look into this the more I am finding that the answer is, no.

So the next question is, can one design an experiment to be able to do this? What I am talking about here is a universal reference such as ERCC spike in or a universal reference sample that is sequenced in every experimental run.

Would love to hear of tips and insights on both these queries.

Cheers

sequencing normalisation RNA-Seq cross-sample • 3.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by ncl.wong • 0

0

Entering edit mode

For RNA Seq, my two cent is that you should have no problem comparing them if you are sure that the two samples are of similar time point and condition and that they are extract in similar ways. If you have the raw data (e.g. reads) instead of the RPKM, then it should be fine?

ERCC is good but I currently feel like it is better for single cell RNA Seq when it come to normalization. Otherwise, it seems to act better as a quality control. Other people might be able to give you clearer insight into this question...

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Sam ★ 4.7k

Ram · Answer 1 · 2015-01-08

Eek, you're pretty much guaranteed to run into a notable batch effect in that situation. While I agree completely with Sam that normalizing to ERCC spike-ins should normally only be done in the context of single-cell sequencing, this case may present an exception to that. I suspect that a combination of spike-ins and RUVseq (in essence, a variant on the SVA concept that uses control genes/spike-ins) would be the closest to what you're looking for.

Ram · Answer 2 · 2015-01-08

0

Entering edit mode

9.4 years ago

ncl.wong • 0

Thank you for your responses Sam and Devon, interesting you make the comment about the utility of ERCC for single cells and I can understand where you are coming from.

I have played with some tissue RNA seq that is publicly available and was not getting any good correlation between experiments plotting FPKM of each tissue from different projects, so there is surely a big batch effect there.

Interested to see how this conversation goes.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by ncl.wong • 0

1

Entering edit mode

I found similar things when I looked at this.

You will have both technical and biological batch effects between the runs.

First you have experimental protocol - Different protocols yield different quantification of the same data. See http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html for comparisons of low input and degraded sample, and some of Joshua Levin's other papers have similar comparisons. Some protocols, for example, favor high GC vs low GC transcripts. Others only pull poly-adenylated transcripts, while others pull more unprocessed transcripts. Complexity differs by protocol. So if you compare one dataset to another they need to be the same protocol, and even the same protocol can perform differently in different hands.

Then, you have biological batch effects which is an even bigger problem. It is difficult to find two cell line samples grown in exactly the same way (i.e. at the same temp, at the same cell density, fed the same), even within the same lab on different days, and "real" samples are even worse.

So it is difficult to block experimental effects from the biological effect you are interested in.

Probably not totally impossible, but computationally complex enough that it might not cost less than running samples for what you are really interested in.

A quick and dirty comparison might be good enough for grant pilot data though.

ERCCs will help with the technical problem but they are shorter and have some other subtle differences from natural human transcripts, so they may not give you enough signal to model all of the technical artifact that is present in the sample. For example, if there is a substantial length bias between two data sets it may not show up in the ERCCs so you may overcall long transcripts as differentially expressed between the sets.

I can't think of how they would help with the biological problem.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Michele Busby ★ 2.2k