Hi everyone,

I'd really appreciate some input on this:

I'm working with alternative splicing event expression data in AML (very heterogeneous). Unlike gene expression, splicing events are quantified using percent spliced in (PSI) which represents the relative inclusion of an exon to the total gene expression. For example, a skipped exon PSI of 0.6 essentially means that 60% of the gene expression comes from transcripts that include the exon and the remaining 40% originate from transcripts that skip the exon.

Unfortunately, sources are limited for discussing how to handle batch effects and confounding factors in splicing PSI data. The current PSI data must be used as is for various (complicated) reasons, so although a few differential splicing methods (differential expression at the splicing level) allow known batches to be included when testing differential expression at the splice event level, those methods 1) need the raw read alignments as input, 2) don't handle unknown confounding effects, and 3) do not address confounders in PSI data for unsupervised learning (i.e. clustering).

I need to perform both unsupervised learning and differential testing (differential expression) using this PSI dataset, but due to the size (N = 450) and complexity of this disease cohort, there are most certainly unwanted sources of variation that need to be accounted for.

The approach I'm considering is to use the SVA package in R to estimate hidden confounding effects and adjust the PSI data using the removebatcheffect() function from the limma package (unlike ComBat it can adjust for continuous effects).

For unsupervised learning (clustering, network analysis, etc) my current approach is this:

```
# Set up model matrix
mod1 = model.matrix(~SubType, ClinicalSummary) # AML sub-type (I want to keep this biological signal)
mod0 = model.matrix(~1, data=ClincalSummary) # Null model matrix
# Estimate and remove surrogate variables
n.pc <- num.sv(t(PSImat), mod1, method="leek") # Returns 2 SVs present
svs <- sva(t(PSImat), mod1, mod0, n.sv=n.pc)$sv
psiMat.adjusted <- removeBatchEffect(t(PSImat), covariates = svs, design=mod1)
```

The above seems to work well, but I'd really appreciate some input from people who are familiar with the methods I'm using.

For differential splicing between two phenotype groups (I have a variety of these, btw) my current approach is this:

```
# Set up model matrix
mod1 = model.matrix(~Response, ClinicalSummary) # Ex) sensitive vs resistant drug response
mod0 = model.matrix(~1, data=ClincalSummary) # Null model matrix
# Estimate and remove surrogate variables in context of outcome of interest
n.pc <- num.sv(t(PSImat), mod1, method="leek")
svs <- sva(t(PSImat), mod1, mod0, n.sv=n.pc)$sv
psiMat.adjusted <- removeBatchEffect(t(PSImat), covariates = svs, design=mod1)
```

Then for each splicing event perform Wilcoxon rank-sum test between the two groups. Select significant events if they (for example) have a change of PSI > 0.1 and FDR < 0.01 after correcting for multiple testing.

A caveat for both sections above is that I'm doing a logit transformation of the PSI values before SVA/removebatcheffect and then an inverse logit transformation after adjustment. This ensures the PSI values remain between 0 and 1 (which does work). I left this out of the pseudo-code for simplicity.

I'd really appreciate some input on this. Again, it's not ideal, especially for the differential testing approach, but keep in mind I can't just use something like edgeR with splicing PSI (at least I don't think so) and these PSI values are pre-quantified so I have to use the data as is.