I am going to be doing RNAseq in batches. This is the situation: Each batch will have different samples, use the same machine, use the same library kit (but not the same prep) , and separated by a few months. Do people ever use a reference sample in these instances to normalize for any batch effects. For example create a large set of aliquots of RNA of the exact same sample or pool of samples and monitor how gene values change across batches. (Lets assume for the sake of argument degradation is not an issue, and that we spread all samples across all lanes of the machine, the timescales I am thinking are approximately 1 year, maybe it is not appropriate to make this assumption?).This paper below used control samples sequenced on two machines to control for platform. "Multi-platform analysis of 12 cancer types reveals molecular classification within and across tissues-of-origin". I havent found many other papers that do this. From the supplement: "We used a set of 19 colon samples that were sequenced on both platforms to estimate platform differences. A limitation of this approach is that the platform correction was restricted to the 16,116 (out of the 20,531 total) genes expressed in colon, defined as those with 3 or more reads. Upper quartile normalized RSEM data was log2 transformed. Genes with a value of zero were set to the missing value after log2 transformation and genes were filtered if they had missing data in greater than 30% of samples. For the 19 colon samples sequenced on each platform, within each dataset the gene median were calculated. The difference between the GAII platform and the HiSeq platform was calculated and subtracted from the full set of GAII data. The corrected GAII set was merged with the HiSeq data set followed by gene median centering."
Is this strategy a good or bad idea, vs other techinques of controlling for batch effect. Lets say spike ins which are mostly just qc and library normalization. Or techniques like COMBAT which require good representation of your populations in your batches so that batch and biology of interest are not confounded.
Any insight is useful.
edit: I will be sequencing clinical samples.