I have a few questions on the topic of batch correction.
The pipeline for this data is currently: TopHat2 (alignment to genome reference) > htseq-count (count reads per gene) > DESeq2 in R
I would really appreciate any thoughts you can offer, or tips on how your lab does things!
My questions are:
- How can you perform batch correction with sequencing day as a variable when not all samples were re-sequenced on both days?
Backstory: We had two batches of sequencing (Day 1 and Day 2). A few, but not all, libraries sequenced on Day 1 were re-sequenced on Day 2. Any suggestions for how I should perform Day1/Day 2 on this data? Is including "SequencingDay" into the design matrix of my DESeq2 object sufficient, or should I rely on SVA, or something else?
- Related to above: how can you perform batch effect correction when you have multiple batch effects and not all possible permutations of variables are represented in the samples?
Backstory: We have some batch effects that unfortunately aren't distributed evenly across all samples - e.g., of 10 samples, say we have suspected effects of Genotype, Sex, and Treatment - but don't have an example of a (Genotype1 + Male + Treatment2) sample. Is the solution basically 'pick your favorite batch effect' and only correct for that? So, in the example above, make your design matrix Genotype + Treatment, and ignore the effect of Sex? Is there a better way?
- How can I integrate spike-ins into my analysis pipeline?
In the alignment step, how do I make sure spike-ins are represented in the output file (i.e. gene counts), if there isn't a special version of the genome reference file? Does merely including spike-ins in our input data boost the accuracy of DESeq2's normalization algorithm enough? Or is there some other layer of normalization I should do when using spike-ins?