**20**wrote:

Hi,

Need some input on here for pathway enrichment using Kolmogorovâ€“Smirnov test.

I have a gene expression dataset of controls versus cases. I also have information for the sample age, gender, batch, strain etc. I put this into a linear model or limma model but the formula of the model is below

```
gene expression ~ age + gender + disease state (case/control) + batch + strain
```

For this above model for each gene i get p values for each of the variables above (i.e p-value for age, gender, disease state). If i am interested in pathway enrichment for disease state i take the p values for this variable. I use these p values for each genes to put into KS test. For pathway enrichment using KEGG and the KS test, for each pathway i extract the p values for the genes in the pathway and those p-values for genes not in each pathway (see below for pseudo code)

```
for each KEGG pathway{
ks_F <- ks.test( x = PVALUES OF GENES IN A KEGG PATHWAY),
> y = PVALUES OF GENES NOT IN A KEGG PATHWAY ),
alternative = "greater" );
}
```

First question: the deafult for the KS test is two-sided but i dont think this is correct as we are only interested in a one sided test so it needs to be less or greater - is this correct? then how do i work out which side i need for the KS test and does it depend each time on the dataset i am analysing?

From the pathway enrichment using KS test then get a list of p values per KEGG pathway using the pseudo code above. For examples sake i get significant enrichment for Alzheimer's and Parkinson's KEGG pathways - This then tells me which pathways are enriched based on disease state. HOWEVER if i repeat the entire as described process using the p values from another variable in the model such as batch or strain and carry out pathway enrichment using the p values for strain and batch i get significantly enriched pathways that are the same as the significantly enriched pathways for disease state. So my question is how is this possible? why when using p values from strain or batch for each gene i am getting the some of the same significantly enriched pathways as the disease state variable i am looking to see what pathways are enriched?

I suppose not alot of people consider looking and extract the p values from other covariates in the models and put this through pathway enrichment but i would like to understand why this occurs and is there any way of compensating for it. The ways i am looking to try and figure this out are to add a binary variable that is not related to any of the others and repeat the enrichment again and see if the same significant pathways come up. Another test i am planning is to shuffle some of the variables and repeat and finally for my gene expression matrix create a simulation of random values for the genes but with the same distribution of my original gene expression dataset. If anyone has any other suggestions please let me know!

Thanks for any help

Danielle