Due to the fact that the outcome variables are categorical, you would have to use glm() and perform binomial/binary or multinomial logistic regression. You would essentially be testing the hypothesis that the expression of your genes of interest are related to the outcome(s) of interest.

# First check that your categorical variables are ordered correctly

Convert your variables to factors with `factor()`

and ensure that you choose the reference ('base') category. For example:

```
factor(MyData$TumourStage, levels=c("StageI", "StageII", "StageIII", "StageIV"))
```

Here, StageI will be assigned as the reference category automatically.
You can also set the reference category specifically with:

```
relevel(MyData$TumourStage, ref="StageI"))
```

# Test each gene independently

For example, you could initially test each gene independently.

```
glm(TumourStage ~ gene1, data=MyData, family=binomial)
glm(TumourStage ~ gene2, data=MyData, family=binomial)
...
glm(TumourStage ~ geneX, data=MyData, family=binomial)
```

You can then check the coefficients, derive odds ratios (ORs), and also a P-value by applying Chi-square ANOVA to the model:

```
mymodel <- glm(TumourStage ~ gene1, data=MyData, family=binomial)
summary(mymodel)
exp(cbind(OR=coef(mymodel), confint(mymodel))) #ORs
anova(mymodel, test="Chisq") #Chi-square ANOVA
```

# Build predictive models using multiple genes

You could also build predictive models and choose the best predictors through step-wise regression:

```
null <- glm(TumourStage ~ 1, data=MyData, family=binomial)
full <- glm(TumourStage ~ gene1 + gene2 + ... + geneX, data=MyData, family=binomial)
require(MASS)
forward <- stepAIC(null, scope=list(lower=null, upper=full), direction="forward") #forward stepwise regression
backward <- stepAIC(full, direction="backward") #backward stepwise regression
both <- stepAIC(null, scope=list(lower=null, upper=full), direction="both") #forward + backward stepwise regression
```

Then, check each of these models to see which one is best. The interpretation of the word 'best', here, is very much open for discussion! One way to test each model is through cross-validation and ROC analysis

# Cross-validation analysis

Here, 'K' is the number of chunks into which you will divide your dataset, and also the number of times that you will repeat the cross-validation. A value of 3 is common, but here I use 10-fold. K can also equal the total number of samples in the dataset, in which case the type of cross-validation is known as 'leave-one-out' cross-validation.

```
require(boot)
cv.glm(MyData, full, K=10)$delta
cv.glm(MyData, both, K=10)$delta
cv.glm(MyData, forward, K=10)$delta
cv.glm(MyData, backward, K=10)$delta
```

Then, plot deltae from the different models. The smallest difference and lowest values of deltae may represent the 'best' model

# ROC analysis

```
ROC analysis with 95% confidence intervals
require(pROC)
roc <- roc(MyData$TumourStage, fitted(full), smooth=FALSE)
plot.roc(roc, grid=TRUE, grid.lwd=2)
text(0.5, 0.0, paste("AUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.8)
roc <- roc(MyData$TumourStage, fitted(both), smooth=TRUE)
plot.roc(roc, grid=TRUE, grid.lwd=2)
text(0.5, 0.0, paste("AUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.8)
et cetera.
```

# Perform lasso-penalised, elastic-net, or ridge regression (if many predictors)

See my answers on these Biostars threads:

Dear Kevin,

nice to see you again on a different but very interesting topic !! Indeed, your considerations and approaches are very comprehensive and detailed. To focus on the matter:

My gene signature mentioned, has been produced from complex linear models with limma, with additional feature selection developed on our lab (independent on gene expression, based on functional enrichment). So it disciminates very efficiently primary specific cancers from adjucent controls. However, except this, i would like to investigate, if the same signature has also any putative prognostic implication. The major issue here, is the only 30 samples, but i could also test the same genes on bigger TCGA cohorts. Thus, based on your suggestions:

1) your idea about constructing a glm of each gene independently, seems very nice, especially if i could implement a for loop or something like a function to construct this automatically. If i understand your approach here with the p-value: as for instance, stage I is the reference level about the putative association of one gene with Tumor Stage, a low chisquare p-value (i.e. less than 0.05), could indicate a difference of any of the rest levels with level I ? or more conviniently in my case, as i have two groups: StageI/II and Stage III/IV ? or has a different interpretation ?

2) Perhaps, some exploratory boxplots could also be a nice addition, but of course would not have the significance of a test of association, correct ?

3) elastic net is great, especially with cv.glmnet to find informative predictors-perhaps in this case, you probably mean to use a cross-validation with cv.glmnet() to find an optimal λ, and then with alpha=0 see if these 94 genes could be associated with Tumor stage for example ? As again i use only my cancer samples ?

Best,

Efstathios

Hi Efstathios,

1) If you only have StageI/II and Stage III/IV (with StageI/II as the reference level) then testing each gene independently via

`glm()`

will essentially be binary logistic regression, and a significant P value from Chi-squared ANOVA will just indicate that the expression of the gene is significantly associated with StageIII/IV. You will have to check the model estimate via the`summary()`

function in order to determine if the gene's expression is increased or decreased in relation to StageIII/IV.It is possible to build a for loop in R to do this. You first put your genes into a vector and then use a combination of

`paste()`

and`as.formula()`

, something like this:2) Boxplots would also be great. These will help you to see if any genes are outliers. The basic

`boxplot()`

function in R is good but you can generate nicer boxplots usingggplot2.3) Considering that you have 94 genes, you should do lasso-panalized regression in place of stepwise regression. Stepwise regression is only suitable for ~30 predictors, whereas lasso can tolerate 100s or 1000s. Yes, I mean,

`cv.glmnet()`

, which essentially just calls`glmnet()`

multiple times when it's doing the cross validation. I would start with the lasso panalty (alpha=1) and then, if you struggle to find predictors, use the elastic-net penalty at alpha=0.5, followed by the ridge panalty (alpha=0).Your sample number is indeed very low. Validating your on the TCGA data is a great idea.

Kevin

Dear Kevin,

please excuse me to reply with delay, but due to many other tasks just before now i had time to read carefully your answer. Firstly, thank you again for your suggestion !! The for loop seems something similar i had i mind. Just a small confirmation on this:

1) the object "Mydata", essentially would be a data frame including in the columns all the 94 genes and the 1 categorical variable of interest-for instance Tumor Stage I/II & III/IV levels, correct ? and the samples in the rows ? Moreover, i should add also the chunk code your proposed above for the chi-square ANOVA based on the results of glm ? or it is intended for more categorical levels ?

2) Concerning the boxplots, i came up with an idea about multivariable boxplots:

but for 94 genes, it might be not suitable for the same image plot to illustrate them.

3) Lastly, i came up with a very interesting question, that might suit our current discussion: i have also aquired 8 continuous clinical parameters, measured on the same patients (both cancer and normal samples) , and i would like to identify any interesting correlation patterns, between my gene signature discussed above, and these clinical parameters group. However, my main problem is the following:

as my gene expression microarray data have already been preprocessed/normalized (rma in R, log2, etc), my clinical variables are continuous but however not normalized/transformed. Moreover, some of the 8 variables in total are highly skewed towards zero, as also having different ranges and "outlier values". For example the range from one variable is from 0.002518 to 27.300000, whereas for another from 0.971517 to 1.432967.

Based on the above, i tried a simple Spearman correlation analysis (with no any transformation-based on the total samples with also corresponding p-values), as Spearman is robust to outliers and non-parametric, with which i have identified some modest but interesting correlations. However, if i would like to inspect with another more "formal", statistical approach, a more significant proof of any association/causation between any of the clinical vars with any of the genes, how should i proceed in your opinion ? I should again utilize some categorical variable ? My "overall aim" for this, is then to "naively" suggest for instance, to predict an increase from the (expression)values of a clinical variable, the increase in expression of a correlated gene, if i could describe it formally.

Best,

Efstathios

Hi friend,

1) Yes, the data-frame MyData will contain a single column for the categorical variable, and then 94 columns for the genes. The samples are rows. In each loop in the for loop, you call a different gene using

`genelist[i]`

or`colnames(MyData)[i]`

. You can add any amount of code to each loop including the ANOVA.2) I answer this below!

3) If you have continuous variables, you can attempt to fit a statistical regression model to these variables, again by using the genes as predictors (and some function, such as stepwise regression, to pick the best predictors). The type of model depends on the distribution of these variables. If you believe that they follow a linear relationship to gene expression, then just fit a model with

`lm()`

. However, if they have some non-liner relationship, you may fit a`glm()`

and choose a specific family such as inverse Gaussian or Poisson (see here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html). You should plot each continuous variable in order to see the trend they follow. LOESS is another possibility (fitted with`loess()`

)2) For boxplots, yes, there is no easy way to show 94 genes at the same time! I think that you can plot multiple boxplots on the same print sheet, though. In my recent publication from Boston (HERE), I plotted 30 boxplots on the same sheet, again using a loop and my own function:

Here,

ScatterMatrixis the same asMyData, with the first column being the categorical variable ('Module') in my case.Dear Kevin,

thank you again for your excellent code regarding the boxplots and valuable insights for this matter !! Firstly, i will focus on my last part about the clinical continuous variables and gene signature, and hope i have understood your general approach and might try something meaningfull:

Above all, my final goal (which is far distinct from the usual data analysis and functional enrichment), is except the initial putative correlation patterns based on my whole dataset, especially for only the cancer samples, be able to detect (if possible) if i could from any of my non-invasive clinical continuous parameters, "predict'' the change in expression levels of any of the genes in my gene signature (or being ably to quantify a significant relationship, far above a simple correlation).

Thus in your opinion:

1) As the 8 clinical quantitative variables are non-normalized-raw continuous values-, and with all the issues discussed above (skewed values, outliers, etc), a glm model would be more appropriate to catch up also non-linear relationships? Also, based on the characteristics of these variables, there will probably be violations in the linear regresssion assumptions

2) Based on the distribution and to plot each continuous variable: you meant, to just plot the values of each of the 8 continuous variables, to inspect the distribution ? Or something other, as a scatterplot of pairwise genes with the clinical variables ?

3) Also, in my case, as i would like based on the clinical variables (non-invasive) to somehow "predict" or quantify the change in any relevant genes, i should use the clinical variables as the independent predictors, and the genes as dependent ? For instance,

dd <- lm(CD44~ClinicalVar1, data=dat) ?

4) In addition to the previous question-my last question perhaps is the most challenging: as i have "multiple" predictors, as also "dependent" continuous variables, i came up with an idea: it would be interesting, to find any significant association with the above statistical regression approaches, in specific "subtypes" of my cancer samples. The only categories that i have, could be two groups: T1/T2 vs T3/T4 samples regarding Tumor Stage, as also N0 vs N1/N2 regarding lymph node metastasis. Thus, despite the relatively small sample size, do you believe that it would be feasible to somehow quantify a "signficicant" relationship of any of these genes with any of these clinical variables, for instance in one of the above groups and not i the other ? As this would be a very interesting ultimate goal for diagnostic senarios, etc ? And if so, how could i proceed with this ? Subset my data into the relevant two sample classes, but then how could i implement a glm model ?

Please excuse me for any naive questions, and if my long message has disturbed you, but based on our very interesting conversation, this scenario came up; so, it would be quite exciting to test it, even the small sample size, to detect any interesting points for consideration !! Unfortunately, i have mainly used limma/edgeR, etc, and not approaches like the above, that's why my detailed question !!

Best,

Efstathios

Hi friend, this is a difficult but important decision. There is no right or wrong answer.

`lm()`

) but remove any outliers that go against the linear assumption. If you include outliers, they may severely bias the model assumptions.`nls()`

).`loess()`

From my own perspective

`glm()`

has better utility when you are comparing categorical variables, like in binomial logistic regression or multinomial logistic regression! The final decision has to come from you and you will have to provide evidence of why you made the decision. I can only guide you here.The box and scatter plots are more for comparing genes between two categories (like, case/control). It would be beneficial to just see a simple plot of the continuous variables in order to see their distribution, like using

`hist()`

or`plot()`

. The`summary()`

function also provides useful information that statisticians use to check for outliers.In this example, you are using the clinical variable to predict the values of the expression of

CD44- that seems to indeed be what you want. If, however, you want to use gene expression to predict the values of the clinical variable, then it should be:`dd <- lm(ClinicalVar1 ~ CD44, data=data)`

(you can also add extra predictors as`lm(ClinicalVar1 ~ CD44 + CD56 + CD20)`

)In this situation, you will have to use a multinomial logistic regression model with the glm() function, and you will have to factorise the tumour stage and lymph node variables with something like:

`stage <- factor(stage, levels=c("T1/T2","T3/T4"))`

`lymphnodes <- factor(lymphnodes, levels=c("N0","N1/N2"))`

This places the less severe versions of these as the reference/baseline levels, against which the more severe versions will be compared in the model. It is also possible to combine these as as single categorical variable using

`paste()`

, although this may severely reduce your sample n in each category.To then build the model, you could use something like:

Hope that this helps! - efharisto!

Kevin

Dear Kevin, -----one more time thank you for your consideration on this matter-----

in order to recap and hopefully to see if i have understood your approach:

Regarding 1 & 2: i have created some initial histograms and density plots for each of the 8 continuous variables: the most common pattern of these distributions, is that there are mostly skewed (for istance towards zero), and with some extreme values-possibly outliers, and they also have different ranges for this. Thus, this was my issue for using lm, because of the normality assumptions etc. Of course, there are various ways of "putative" normalization, but it could be very "difficult" to justify one or another. Perhaps, i could scale all my needed features-that is, both my normalized gene expression and the PET variables-and then use for instance linear models. Also, as you said there is certainly no "right" answer for this !For 3 and 4 to make a combination: as you have mentioned, i would like to inspect if from any of the continuous clinical variables, i could "predict" the change in expression in any of the genes in my signature. But this, would mostly be appropriate to check if "happens" in different groups of cancer samples :for instance, any of the clinical vars have a "significant" association and been able to predict the alteration in any of the genes in the N1/N2 samples, but not in the N0 group ---this would be the final essential point to test. Thus, based on your suggestions:

in able to catch more general associations, if i would like to use glm instead of lm, and test the above hypothesis like the following:

and this would be an example for only a specific gene ? while i can add all the PET variables, or for instance only the ones that showed a positive correlation ?

this, could somehow implemented with glm ? or with glm (and please correct me if i have understood wrong), you only test if the gene or clinical variable in each case is associated with any of the categorical levels ? but not of course any linkage between genes and clinical variables ?

Finally, if my notion is correct and for the genes/clinical vars association, i could use something alternative for lm, without need for normality assumptions/and only linear associations, which would be the most feasible option ?

Best,

Efstathios-Iason

Hi Efstathios,

For the 8 variables that you have, it sounds like the distribution is too skewed such that they could be used for

`lm()`

in their current form. What happens when you log these variabes? - does that produce a more linear distribution? If the distribution is too skewed right now, then you could possibly convert these cintinuous variables into categorical variables, like:...then compare these in a multinomial gistic regression model (

`glm()`

).Yes, you can do it this way, i.e., separating by group (N0 and N1/N2). The important thing is that, whichever way you do it, you just have to ensure that you understand how to interpret the result based on the way that you do it. You can include all of the variables in these models and then check them with the

`summary()`

function, e.g.,`summary(dd1); summary(dd2)`

, and then remove any variables that are not statistically significant. At the end, generally, you should only include the statistically significant variables in the model. For N0 and N1/N2, I would expect the final models to have different variables.After you have decided on a final model, you can check the validity of the model through cross-validation analysis, and also derive odds ratio. Take a look at my answer here: C: Resources for gene signature creation

Kind regards, Kevin

Hi Kevin,

A) Firstly, regarding the nature and characteristics of the 8 continuous clinical variables: below i have created some density plots of the most variables in raw scale and in log2 transformation-unfortunately, whereas in some variables, the distribution curve becomes more "bell-shaped", in the majority of the others, due to a plethora of values near zero, after log2 transform many and large negative values appear.

i thought in parallel other kinds of normalization, like box-cox transformation, but again im not certain if it is more "appropriate" than to leave them on raw scale, and perhaps try something else. For instance, also i post below an additional qqplot of one clinical variable with raw values and after boxcox transfrom:

https://www.dropbox.com/s/vmgmv5gq0wspu6j/plot.density.pet.var2.png?dl=0 https://www.dropbox.com/s/and437c5q7ka82l/plot.density.pet.var3.png?dl=0 https://www.dropbox.com/s/fotcftthqgdzbtu/plot.density.pet.var4.png?dl=0 https://www.dropbox.com/s/zfgnbopczu28ahq/plot.density.pet.var5.png?dl=0 https://www.dropbox.com/s/ps7tro3her7jow1/plot.density.var1.png?dl=0 https://www.dropbox.com/s/76es3hatfx8l2h4/var8.inspect.png?dl=0 https://www.dropbox.com/s/9sk9hpd6bbaunaj/qqplot.suv.boxcox.transform.png?dl=0

B) Thus, to summarize in relation to the above part, after these exploratory plots, what is your opinion ? I could use lm(), but for instance standardize (center and scale) all my used continuous variables ? That is for both clinical & gene expression data ? Or generally due to the assumptions of lm, another model would be more valid to use ?

For example, in case i would use glm, like lm, how i could appropriately implement it:

as for instance:

## whereas, for instance:

Or i should formulate the glm formula above differently for my purpose discussed above ?

Please excuse me for my last question, but this distinction is a bit confusing, and need to be certain in order to move to further steps, such as odds ratio, etc

Best,

Efstathios-Iason

Hello, based on that information (i.e. after having seen the distributions of the variables), I would only use a generalised linear model (

`glm()`

), and I would only this with the unlogged variables and the following parameter:`family="Gamma"(link="log")`

For example:

Then choose the best variables from the

`summary()`

function applied to the model and re-run the model.If you wish to create a model with your clinical variables and a gene (e.g.

CD44), then also use`glm()`

:The only situation where you could really justify using the

`lm()`

with your gene is if the ClinicalVariable was categorised, such as 'LessThanZero' and 'GreaterThanZero'.Remember that there is no correct way here! Someone will always find a way to criticise your work at every possible chance they can get.

Hello Kevin, thank you for your further clarifications-thus in order to finally summarize my thoughts, and being sure that i understood correctly your examples, specifically for the code implementation-Thus:

1) In any of the above cases, it is "safer" to use glm but with distinct purposes. In other words:

A)

`With glm(stage ~ ClinicalVar1 + ClinicalVar2 + ... ClinicalVarN, family="Gamma"(link="log"),data=data):`

## with the above, i essentially use all my available dataset with the categorical variable stage, and inspect if there is any significant association with any of the two groups with any of the clinical variables, correct ? and could further move with odds ratio, etc

B) Whereas with:

`data1 <- glm(CD44 ~ ClinicalVar1 + ClinicalVar2 + ... ClinicalVarN, family="Gamma"(link="log"),data=data1)`

## as also for the

data argument,i should separate my data frame into 2 dataframes, as discussed above ? and also with data=data2, correct ? And this would show a putative association of any of the clinical vars to predict the value of a speficic gene, correct ? with interest in differences between the two groups ?Finally, you last implementation with lm might be also performed, by example fitting a median threshold in the values of each variable, and perform the same approach.

Hey Efstathios,

Yes, that is a 'yes' to all of your questions, including the dividing of the data into data1 (N0) and data2 (N1/N2) when using the

CD44model.Just remember, too, that testing each clinical variable independently is different from testing them together. You can do both types and check the results.

For example:

This asks:

Is there an association between ClinicalVar1 and tumour stage?, i.e., can we use ClinicalVar1 to predict very well the tumour stage.Can the combination of Var1 and Var2 predict very well tumour stage?In this case, if ClinicalVar1 is highly significant on its own but ClinicalVar2 is not, then this model will still be poor. Each variable's information is 'adjusted' based on all other variables in the model. The quickest way to see which variables are good is by using the`summary()`

function. To then check the overall power of the final model, which may have`stage ~ Var1 + Var5 + Var7`

, you have to do cross-validation and ROC analysis, which I mentioned at the very beginning (and also derive the odds ratios).Kevin, thank you for your confirmation-i suppose the above example suits for the implementation of glm with one categorical variable and 1 or more continuous clinical, correct ?

whereas the implementation with both continuous groups, will just identify-if any-signficant associations with any of the vars with the gene right ?

Yes, this:

`glm(CD44 ~ ClinicalVar1, family="Gamma"(link="log"), data=data1)`

is checking for an association between ClinicaVar1 and the expression levels ofCD44(in the N0 groupCan we use ClinicalVar1 to predict the expression of the CD44 gene in N0 patients?). We use Gamma distribution due to the distribution of your ClinicalVar.By the way, if you divide your dataset into N0 and N1/N2, you should also check this model:

`glm(NodesInvolved ~ CD44, data=data)`

. In order to justify the division into N0 and N1/N2 for the CD44 model, you may have to first show that CD44 is statistically significantly different between N0 and N1/N2. This is likely a question from a statistician looking at your work in the future.Yes exactly Kevin,

you got my point before answer it !! (also assume that with NodesInvolved above you meant the categorical variable for Lymph Node status that has the 2 levels discussed)--I could also perhaps implement statistical comparisons with limma and interaction models (etc),-but would be far more complicated for the design matrix, etc-. Thus, this approach as an alternative looks the same valid and more simple. Also, I don't know about any general issues that could be adjusted for confunders as with limma (but this could be another whole topic for discussion)-, but this is one approach straight to the point:

1) I could use your implemented loop for checking significant association between any of the gene signature and either tumor stage groups or lymph nodes:

2) And then, for any significant resulted genes (with additional odds ratio and anova test provided), inspect subsequently any putative association and prediction from any of the pets, as discussed above:

For example:

My extra consern, is if i would have to do any multiple testing correction, but im not sure it applies here in the exact sense like microarrays or RNA-Seq.

Hello, yes, that seems like a good plan (and, 'yes', use

`family=binomial`

in the first part with the loop). If you want to parallelise the loop to make it run quicker, you can take a look at my code here (see 'Any type of LM, GLM, clogit (here: clogit)'): R functions edited for parallel processing (this may be too complex for now, though).Then, yes, you can check the significant genes in the other model ('

2)').You do not need to worry about limma here - I do not believe that limma is a good choice (because limma is fundamentally based on linear models). You neither need to worry about multiple-test corrections; however, as I said before, some person will always try to find a way to criticise - you just need to prepare for that.

Thanks for the additional links- i will now hopefully try to summarize and finally implement all the above exciting things-as for your last comment, of course two famous phrases suit here: "Essentially, all models are wrong, but some are useful", or even " among competing hypotheses, the one with the fewest assumptions should be selected"...

Πολύ καλά! Να με ενημερώνετε για την πρόοδο

Very good! - keep me updated

Just one more thing: you will also likely receive criticism for using the gamma distribution in the second model, but remember that it is due to the distribution of the clinical variables, which is clearly non-normal (not binomial).

Hi Kevin,

again a very crussial comment:

i understand from a small search that this is a generally huge "debate" generally on distribution model selections in regression analysis and generally on statistical implementations, like gene expression. For example, i noticed many people talking about normality assumptions needed only in the error terms, and not the actual normal distributions of variables in linear models.

About the choise of gamma distribution--it is mainly due to the general non-normality of the clinical variables, as also for the skewness noticed in various distributions of these, right ? and binomial in the first place, as the genes are normalized correct ?

but again, in the second model with the gamma distribution, any significant association detected, for instance would still related to ie.:

an increase in one clinical variable, indicate also increase of the specific gene, right ? although not with the same sence of " direct linearity" with linear models ?

Yes, I think that gamma is better due to the skewness of your clinical variables. However, we are regressing these clinical variables against a normal distribution (the gene

CD44).I still believe that Gamma is the best choice here, but you will have to let the results guide you. If you wish, you can do further simple diagnostics by plotting a simple linear fit to visually test the assumption that the association is linear:

This will fit a linear line, but the relationship between x and y may look anything

linear, in which case we can support the choice of Gamma.butProbably,

i will have to test it for each of any significant genes identified in the first step, with each of the clinical variables. And hope the the majority of clinical variables follow the pattern we are discussing. The data will probably show.