Hello,
we're working on a differential expression (DEG) analysis using DESeq2 for a dataset involving an eukaryotic host experimentally infected with a virus.
Dataset: Our design includes comparisons across different infection treatments, between infected vs. non-infected controls, and across four time points. This leads to 16 different "conditions" including these three variables, and we have 4-5 biological replicates per condition after QC.
Read quantification: We quantified transcript expression using Salmon, with a combined index that includes both host and viral transcriptomes. We used tximport to map the transcripts to host and viral genes. Across all samples, about 99% of reads map to the host.
Our goal is to analyze differential expression in both host and viral genes. This leads to a key question:
Should we perform DEG analysis on the combined host + virus transcriptome in DESeq2, or analyze host and viral genes separately?
A different post here suggested that this is often the best choice unless "the inter-sample variability (e.g. the spread of points as you could see in a PCA plot -- see vignette) is vastly different across subsets." This is certainly the case here, if I understand this point correctly. A PCA of vst counts separates time point 3 from everything else in PC1 (~35% var explained) when looking at this joint analysis or just the host (these two very similar) but the virus-only PCA1 looks very different (~ 84%, separating time point 1). This makes sense (biologically) to us, but we're not sure if it qualifies as one of these situations that warrants separate DE analysis.
We've noticed substantial differences in results depending on the approach. For example, in a particular contrast between infected treatments, we identify 172 significantly viral DEGs when analyzing the full dataset (all of them in the same direction compared to the control), but only a handful when restricting the analysis to viral genes alone.
Subsetting to viral genes results in much smaller library sizes, and we're also considering that host and viral genes may be influenced by different biological processes or technical factors. Given these considerations, what would be the most appropriate strategy for this kind of analysis?
Thanks in advance for your insights!
Cross-posted from Bioconductor.