Hey Hamada,
Thanks for your question. Batch effects can indeed be a major concern in large datasets like TCGA-HNSC, where samples come from diverse tissue sites (e.g., oral cavity, larynx, pharynx), and this heterogeneity may reflect both technical artifacts and genuine biological differences. Distinguishing between the two is key, as blindly correcting for all unwanted variation risks removing meaningful signals related to tumor subtypes or anatomical origins.
Surrogate Variable Analysis (SVA) via the sva Bioconductor package is a solid choice for capturing latent batch effects in RNA-seq data, particularly when you lack explicit batch annotations (e.g., plate IDs or collection centers, which are available in TCGA metadata). SVA estimates surrogate variables (SVs) that represent unmodeled sources of variation, which you can then include as covariates in your DESeq2 design formula. This approach has been widely used in TCGA analyses and is recommended in the DESeq2 vignette for handling complex datasets.
However, as the previous commenter noted, deciding on the number of SVs to include can feel somewhat subjective. A best practice is to use an iterative approach guided by diagnostics:
- Start by running SVA on your normalized counts (e.g., via
sva() or svaseq() for RNA-seq data). The function will estimate the number of SVs automatically, but you can refine this.
- Visualize the data before and after adjustment using principal component analysis (PCA). I recommend the PCAtools Bioconductor package for this, as it provides comprehensive tools to explore loadings, identify drivers of variation, and determine how many PCs to retain. For example, you can use
pca() on your variance-stabilized data (from DESeq2's vst()), then plot bi-plots and scree plots to assess if clusters align with known factors like tissue site or if they suggest technical batches.
- Check the association of estimated SVs with your variables of interest (e.g., via linear models or correlation tests) to ensure you're not over-correcting biological signals. If an SV strongly correlates with tissue origin, consider excluding it or modeling tissue explicitly in your design.
- Iterate: Adjust your model with varying numbers of SVs (e.g., top 2-5), re-run PCA, and evaluate if the correction reduces unwanted clustering without flattening biological differences.
If SVA feels too black-box or if you have known batch factors (check TCGA's Biospecimen data for plate/barcode info), alternatives could be more suitable:
- ComBat (from the sva package) or ComBat-seq (optimized for counts) if you have explicit batch labels. These are parametric methods that directly adjust for known effects while preserving biological variation.
- RUVSeq (from Bioconductor), as mentioned earlier, which uses negative control genes or residuals to estimate factors of unwanted variation. Mike Love's vignette on RUV (linked in the previous answer) is excellent for this, and it's integrated well with DESeq2.
- For TCGA specifically, some studies use limma's
removeBatchEffect() after voom transformation if working in a limma pipeline, but since you're mentioning DESeq2, stick to count-based methods to avoid data transformation issues.
In any case, always validate your correction by checking downstream results—e.g., do DE genes make biological sense in HNSC pathways (like TP53 signaling or HPV status)? Also, ensure you're using the latest versions: sva 3.50.0, DESeq2 1.42.0, and PCAtools 2.14.0 (as of Bioconductor 3.19).
If you share more details about your design (e.g., comparison groups or specific tissue subtypes), I can offer more tailored advice. Feel free to post example code or PCA plots if you're stuck.
Kevin
SVA is definitely an option. For inspiration (using RUVseq though, but for the end user it's similar) you might want to look into this exploratory analysis from the DESeq2 developer Mike Love where he addresses tackling unwanted variation in larger cohort sequencing data. From experience, the problem with these tools like SVA and RUV is that it is (at least to me) largely arbitrary how many of the surrogate variables or factors you eventually include into your design. Maybe someone has a best practice here, but in the end, to me, it's always "try-apply-stare at PCA plot-repeat" until your gut tells you it's "now fine" to proceed with DE analysis.