I have two batches of RNAseq data, one containing all of my cases and one containing all of my controls. Assume that I cannot alter the status of these batches to scatter my cases and controls among both batches, which would obviously be better.
What if I took RNA from a cell line (neither case nor control), split it into two aliquots, and added it into each batch? Are there methods that will allow me to model the batch effects using this technical replicate, then apply correction to the rest of the samples?
Seems like RUVseq tried to do something similar using ERCC probes (with only moderate success), but that's slightly different, since it's explicitly defining a set of ERCC "genes" to use for modeling the batch effects.
Biologically I don't see how what you're describing could claim to model any more than just a (small) portion of the batch effect. If there are differences in sample processing between batches prior to the RNA isolation step (time/method of sample collection, method of sample processing, etc) what you describe wouldn't be of any use. It seems like it would only be useful if the only batch effects existed downstream of RNA isolation (differences in PCR amplification(?), sequencing lane, or any other differences in the sequencing workflow itself), but this seems unlikely to be your situation.
The most realistic/likely counterexample I can think is if the main batch effect were duration of RNA storage prior to sequencing, but even then...
Agreed. Maybe you'd get some benefit if you processed the cell line along with each set of samples (assuming they're being processed at the same time) in the same manner. It'd at least let you look at sources of variation identified in the cell line processed with each batch between your case/control to see if they still separate them. Adding it just prior to library prep/seq is likely of limited usefulness.
In my experience, ERCC spike-ins are most useful when global changes in transcription are expected, otherwise trying to normalizing to them (via RUVseq) has muddied the waters further.
Well, what you are getting at here is what we even call a batch - Whether or not these cell line samples would be good for estimating the batch effect would depend on what the actual samples shared in common to be called a batch in the first place.
Yeah, that's absolutely a key part of the question. These are patient samples that have been collected, frozen, etc, in comparable ways. That part of the "batch" is completely out of our hands, though any variability really shouldn't associate preferentially with condition A or B.
For logistical reasons, thawing, RNA extraction, and library prep have to happen several weeks apart (though by the same person with the same brand of kits, etc). The effects we're concerned with would probably be: different batches of reagents in the extraction/library construction, and running on different flowcells on different days.
My thinking is that both of these could be potentially modeled from taking aliquots of the same cell line, splitting and freezing. Then each time a batch of patient samples is being run, they could undergo the same extraction/library prep at the same time, and be pooled with the patient samples for sequencing.
If nothing else, it should let us know if there are major batch effects from those sources, but I don't really know if they're going to provide enough info to model and correct for it (hence the question!) :-)
If that's the case, feels worth a shot to me. ¯\_(ツ)_/¯
I agree with Jared--of course you have thought this through and have a proper use case for what you described! Sadly, I don't have any ideas for how to model the batch effects computationally, however.
No idea if it would work or not, but if I were going to do it, I'd just include them in the design matrix three conditions: A, B or Cell line and two batches 1 or 2.
Agreed. And just pop 'em in a PCA and see if there are any actual batch effects.
I'm sure we'd all welcome an update in a few weeks if you try this strategy out, Chris.