First check that your categorical variables are ordered correctly

Question

Possible methodologies for association of specific gene subsets from microarray data with categorical histopathological parameters

0

Entering edit mode

6.5 years ago

svlachavas ▴ 790

Dear Biostars group,

based on a microarray gene expression analysis in R and limma, comparing tumor samples versus controls, a small gene set was identified. In parallel, i would like to inspect if regarding only the cancer samples, any of these genes could be associated/correlated with specific histopathological variables, such as Tumor Stage and Lymph node status. Thus, this could indicate a putative prognostic/predictive impact of any of these genes to putative treatment/monitoring, etc. Because the latter variables are categorical, which approach/methodology could i implement in R ?

For example, i firstly thought about a very "naive" approach:

fit for example for each gene a linear model (lm()), and then perform an ANOVA test (aov()) on the fit to see if the gene is independent regarding any of the tested group levels of each categorical variable. But this would be enough, to indicate "putative correlation" ?

Moreover, i thought about even some initial boxplots comparing the means of the cancer groups for each variable, but again im not certain if this would be enough.

A limitation i must admit (but could be partially "solved" by trying the same approach to an external dataset of the same phenotype) is the relatively small sample size: the number of cancer samples is only 30.

Any suggestions or ideas would be grateful !!

R correlation microarrays qualitative variables • 3.7k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 87k • written 6.5 years ago by svlachavas ▴ 790

0

Entering edit mode

Dear Kevin,

nice to see you again on a different but very interesting topic !! Indeed, your considerations and approaches are very comprehensive and detailed. To focus on the matter:

My gene signature mentioned, has been produced from complex linear models with limma, with additional feature selection developed on our lab (independent on gene expression, based on functional enrichment). So it disciminates very efficiently primary specific cancers from adjucent controls. However, except this, i would like to investigate, if the same signature has also any putative prognostic implication. The major issue here, is the only 30 samples, but i could also test the same genes on bigger TCGA cohorts. Thus, based on your suggestions:

1) your idea about constructing a glm of each gene independently, seems very nice, especially if i could implement a for loop or something like a function to construct this automatically. If i understand your approach here with the p-value: as for instance, stage I is the reference level about the putative association of one gene with Tumor Stage, a low chisquare p-value (i.e. less than 0.05), could indicate a difference of any of the rest levels with level I ? or more conviniently in my case, as i have two groups: StageI/II and Stage III/IV ? or has a different interpretation ?

2) Perhaps, some exploratory boxplots could also be a nice addition, but of course would not have the significance of a test of association, correct ?

3) elastic net is great, especially with cv.glmnet to find informative predictors-perhaps in this case, you probably mean to use a cross-validation with cv.glmnet() to find an optimal λ, and then with alpha=0 see if these 94 genes could be associated with Tumor stage for example ? As again i use only my cancer samples ?

Best,

Efstathios

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Hi Efstathios,

1) If you only have StageI/II and Stage III/IV (with StageI/II as the reference level) then testing each gene independently via glm() will essentially be binary logistic regression, and a significant P value from Chi-squared ANOVA will just indicate that the expression of the gene is significantly associated with StageIII/IV. You will have to check the model estimate via the summary() function in order to determine if the gene's expression is increased or decreased in relation to StageIII/IV.

It is possible to build a for loop in R to do this. You first put your genes into a vector and then use a combination of paste() and as.formula(), something like this:

for (i in 1:length(genelist))
{
   formula <- as.formula(paste("TumourStage ~ ", genelist[i], sep=""))

   model <- glm(formula, data=MyData, family=binomial)

   etc.
}

2) Boxplots would also be great. These will help you to see if any genes are outliers. The basic boxplot() function in R is good but you can generate nicer boxplots using ggplot2.

3) Considering that you have 94 genes, you should do lasso-panalized regression in place of stepwise regression. Stepwise regression is only suitable for ~30 predictors, whereas lasso can tolerate 100s or 1000s. Yes, I mean, cv.glmnet(), which essentially just calls glmnet() multiple times when it's doing the cross validation. I would start with the lasso panalty (alpha=1) and then, if you struggle to find predictors, use the elastic-net penalty at alpha=0.5, followed by the ridge panalty (alpha=0).

Your sample number is indeed very low. Validating your on the TCGA data is a great idea.

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin,

please excuse me to reply with delay, but due to many other tasks just before now i had time to read carefully your answer. Firstly, thank you again for your suggestion !! The for loop seems something similar i had i mind. Just a small confirmation on this:

1) the object "Mydata", essentially would be a data frame including in the columns all the 94 genes and the 1 categorical variable of interest-for instance Tumor Stage I/II & III/IV levels, correct ? and the samples in the rows ? Moreover, i should add also the chunk code your proposed above for the chi-square ANOVA based on the results of glm ? or it is intended for more categorical levels ?

2) Concerning the boxplots, i came up with an idea about multivariable boxplots:

df.m <- melt(final.dat.sel, id.var = "Tumor_Stage") #where df is a data frame including the expression variables of the 94 genes and the 1 categorical variable Tumor_Stage
 p <- ggplot(data=df.m,aes(x=variable,y=value))
 p <- p + geom_boxplot(aes(fill=Tumor_Stage))
 p <- p + geom_point(aes(y=value,group=Tumor_Stage),position=position_dodge(width=0.75))
 p <- p + facet_wrap(~variable,scales="free")
 p <- p + xlab("gene symbol") + ylab("log2 expression") + ggtitle("Differences in Tumor Stage status in CRC")
 p <- p + guides(fill=guide_legend(title="Tumor Stage"))

but for 94 genes, it might be not suitable for the same image plot to illustrate them.

3) Lastly, i came up with a very interesting question, that might suit our current discussion: i have also aquired 8 continuous clinical parameters, measured on the same patients (both cancer and normal samples) , and i would like to identify any interesting correlation patterns, between my gene signature discussed above, and these clinical parameters group. However, my main problem is the following:

as my gene expression microarray data have already been preprocessed/normalized (rma in R, log2, etc), my clinical variables are continuous but however not normalized/transformed. Moreover, some of the 8 variables in total are highly skewed towards zero, as also having different ranges and "outlier values". For example the range from one variable is from 0.002518 to 27.300000, whereas for another from 0.971517 to 1.432967.

Based on the above, i tried a simple Spearman correlation analysis (with no any transformation-based on the total samples with also corresponding p-values), as Spearman is robust to outliers and non-parametric, with which i have identified some modest but interesting correlations. However, if i would like to inspect with another more "formal", statistical approach, a more significant proof of any association/causation between any of the clinical vars with any of the genes, how should i proceed in your opinion ? I should again utilize some categorical variable ? My "overall aim" for this, is then to "naively" suggest for instance, to predict an increase from the (expression)values of a clinical variable, the increase in expression of a correlated gene, if i could describe it formally.

Best,

Efstathios

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Hi friend,

1) Yes, the data-frame MyData will contain a single column for the categorical variable, and then 94 columns for the genes. The samples are rows. In each loop in the for loop, you call a different gene using genelist[i] or colnames(MyData)[i]. You can add any amount of code to each loop including the ANOVA.

2) I answer this below!

3) If you have continuous variables, you can attempt to fit a statistical regression model to these variables, again by using the genes as predictors (and some function, such as stepwise regression, to pick the best predictors). The type of model depends on the distribution of these variables. If you believe that they follow a linear relationship to gene expression, then just fit a model with lm(). However, if they have some non-liner relationship, you may fit a glm() and choose a specific family such as inverse Gaussian or Poisson (see here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html). You should plot each continuous variable in order to see the trend they follow. LOESS is another possibility (fitted with loess())

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

2) For boxplots, yes, there is no easy way to show 94 genes at the same time! I think that you can plot multiple boxplots on the same print sheet, though. In my recent publication from Boston (HERE), I plotted 30 boxplots on the same sheet, again using a loop and my own function:

#Define a custom plotting function
#Arguments:
#   ScatterMatrix, a data-frame of numbers, with one of more additional columns as factors
#   Module, the factor column whose values by which the plotting will segregate
#   i, the column index whose values will be plotted
ggCustom <- function(ScatterMatrix, Module, i)
{
    require(ggplot2)

    g <- ggplot(ScatterMatrix, aes(x=Module, y=as.numeric(ScatterMatrix[,i]))) +

        #Add a boxplot
        #Control how outliers are managed using extra parameters
        geom_boxplot(position=position_dodge(width=0.5), outlier.shape=17, outlier.colour="red", outlier.size=0.1, aes(fill=Module)) +

        stat_summary(geom="crossbar", width=0.8, fatten=2, color="black", fun.data=function(x){return(c(y=median(x), ymin=median(x), ymax=median(x)))}) +

        #Choose which colours to use; otherwise, ggplot2 choose automatically
        #scale_color_manual(values=c("red3", "white", "blue")) + #for scatter plot dots
        #scale_fill_manual(values=c("royalblue", "pink", "red4")) + #for boxplot
        scale_fill_grey(start=0.5, end=0.99) +

        #Add the scatter points (treats outliers same as 'inliers')
        #geom_jitter(position=position_jitter(width=0.3), size=0.25, colour="black") +

        #Set the size of the plotting window
        theme_bw(base_size=24) +

        #Modify various aspects of the plot text and legend
        theme(
            legend.position="none",
            legend.background=element_rect(),
            plot.title=element_text(angle=0, size=10, face="bold", vjust=1),

            axis.text.x=element_text(angle=0, size=12, face="bold", vjust=0.5),
            axis.text.y=element_text(angle=0, size=12, face="bold", vjust=0.5),
            axis.title=element_text(size=12, face="bold"),

            #Legend
            legend.key=element_blank(),     #removes the border
            legend.key.size=unit(1, "cm"),  #Sets overall area/size of the legend
            legend.text=element_text(size=12),  #Text size
            title=element_text(size=12)) +      #Title text size

        #Change the size of the icons/symbols in the legend
        guides(colour=guide_legend(override.aes=list(size=2.5))) +

        #Set x- and y-axes labels
        xlab("Cluster group") +
        ylab("Log metabolite level") +

        ylim(min(ScatterMatrix[,i])-0.4, max(ScatterMatrix[,i])+1.1) +

        ggtitle(colnames(ScatterMatrix)[i])
    return(g)
}


#Create a loop to invoke the custom plotting function over each metabolite and store the plots in a list
plotlist <- list()
for (i in 2:ncol(ScatterMatrix))
{
    plotlist[[i-1]] <- ggCustom(ScatterMatrix, "Module", i)
}

pdf("Signature.BoxScatterPlot.pdf", width=14, height=9)
    require(cowplot)

    plot_grid(plotlist[[1]], plotlist[[2]], plotlist[[3]], plotlist[[4]], plotlist[[5]], plotlist[[6]], plotlist[[7]], plotlist[[8]], plotlist[[9]], plotlist[[10]], plotlist[[11]], plotlist[[12]], plotlist[[13]], plotlist[[14]], plotlist[[15]], plotlist[[16]], plotlist[[17]], plotlist[[18]], labels=c("A"), label_size=28, ncol=6, nrow=3)
dev.off()

Here, ScatterMatrix is the same as MyData, with the first column being the categorical variable ('Module') in my case.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin,

thank you again for your excellent code regarding the boxplots and valuable insights for this matter !! Firstly, i will focus on my last part about the clinical continuous variables and gene signature, and hope i have understood your general approach and might try something meaningfull:

Above all, my final goal (which is far distinct from the usual data analysis and functional enrichment), is except the initial putative correlation patterns based on my whole dataset, especially for only the cancer samples, be able to detect (if possible) if i could from any of my non-invasive clinical continuous parameters, "predict'' the change in expression levels of any of the genes in my gene signature (or being ably to quantify a significant relationship, far above a simple correlation).

Thus in your opinion:

1) As the 8 clinical quantitative variables are non-normalized-raw continuous values-, and with all the issues discussed above (skewed values, outliers, etc), a glm model would be more appropriate to catch up also non-linear relationships? Also, based on the characteristics of these variables, there will probably be violations in the linear regresssion assumptions

2) Based on the distribution and to plot each continuous variable: you meant, to just plot the values of each of the 8 continuous variables, to inspect the distribution ? Or something other, as a scatterplot of pairwise genes with the clinical variables ?

3) Also, in my case, as i would like based on the clinical variables (non-invasive) to somehow "predict" or quantify the change in any relevant genes, i should use the clinical variables as the independent predictors, and the genes as dependent ? For instance,

dd <- lm(CD44~ClinicalVar1, data=dat) ?

4) In addition to the previous question-my last question perhaps is the most challenging: as i have "multiple" predictors, as also "dependent" continuous variables, i came up with an idea: it would be interesting, to find any significant association with the above statistical regression approaches, in specific "subtypes" of my cancer samples. The only categories that i have, could be two groups: T1/T2 vs T3/T4 samples regarding Tumor Stage, as also N0 vs N1/N2 regarding lymph node metastasis. Thus, despite the relatively small sample size, do you believe that it would be feasible to somehow quantify a "signficicant" relationship of any of these genes with any of these clinical variables, for instance in one of the above groups and not i the other ? As this would be a very interesting ultimate goal for diagnostic senarios, etc ? And if so, how could i proceed with this ? Subset my data into the relevant two sample classes, but then how could i implement a glm model ?

Please excuse me for any naive questions, and if my long message has disturbed you, but based on our very interesting conversation, this scenario came up; so, it would be quite exciting to test it, even the small sample size, to detect any interesting points for consideration !! Unfortunately, i have mainly used limma/edgeR, etc, and not approaches like the above, that's why my detailed question !!

Best,

Efstathios

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

0

Entering edit mode

1) As the 8 clinical quantitative variables are non-normalized-raw continuous values-, and with all the issues discussed above (skewed values, outliers, etc), a glm model would be more appropriate to catch up also non-linear relationships? Also, based on the characteristics of these variables, there will probably be violations in the linear regresssion assumptions

Hi friend, this is a difficult but important decision. There is no right or wrong answer.

Linear: If you believe that the relationship between gene expression and your continuous variables should be linear, then you should use a linear least squares regression model (lm()) but remove any outliers that go against the linear assumption. If you include outliers, they may severely bias the model assumptions.
Non-linear: if you make the assumption that the relationship between gene expression and your continuous variables is non-linear, then you should use non-linear least squares regression model (nls()).
Another alternative for non-linear relationships is to fit a local regression on subsets of your data using loess()

From my own perspective glm() has better utility when you are comparing categorical variables, like in binomial logistic regression or multinomial logistic regression! The final decision has to come from you and you will have to provide evidence of why you made the decision. I can only guide you here.

2) Based on the distribution and to plot each continuous variable: you meant, to just plot the values of each of the 8 continuous variables, to inspect the distribution ? Or something other, as a scatterplot of pairwise genes with the clinical variables ?

The box and scatter plots are more for comparing genes between two categories (like, case/control). It would be beneficial to just see a simple plot of the continuous variables in order to see their distribution, like using hist() or plot(). The summary() function also provides useful information that statisticians use to check for outliers.

3) Also, in my case, as i would like based on the clinical variables (non-invasive) to somehow "predict" or quantify the change in any relevant genes, i should use the clinical variables as the independent predictors, and the genes as dependent ? For instance,

dd <- lm(CD44~ClinicalVar1, data=dat) ?

In this example, you are using the clinical variable to predict the values of the expression of CD44 - that seems to indeed be what you want. If, however, you want to use gene expression to predict the values of the clinical variable, then it should be: dd <- lm(ClinicalVar1 ~ CD44, data=data) (you can also add extra predictors as lm(ClinicalVar1 ~ CD44 + CD56 + CD20))

4) In addition to the previous question-my last question perhaps is the most challenging: as i have "multiple" predictors, as also "dependent" continuous variables, i came up with an idea: it would be interesting, to find any significant association with the above statistical regression approaches, in specific "subtypes" of my cancer samples. The only categories that i have, could be two groups: T1/T2 vs T3/T4 samples regarding Tumor Stage, as also N0 vs N1/N2 regarding lymph node metastasis. Thus, despite the relatively small sample size, do you believe that it would be feasible to somehow quantify a "signficicant" relationship of any of these genes with any of these clinical variables, for instance in one of the above groups and not i the other ? As this would be a very interesting ultimate goal for diagnostic senarios, etc ? And if so, how could i proceed with this ? Subset my data into the relevant two sample classes, but then how could i implement a glm model ?

In this situation, you will have to use a multinomial logistic regression model with the glm() function, and you will have to factorise the tumour stage and lymph node variables with something like:

stage <- factor(stage, levels=c("T1/T2","T3/T4"))
lymphnodes <- factor(lymphnodes, levels=c("N0","N1/N2"))

This places the less severe versions of these as the reference/baseline levels, against which the more severe versions will be compared in the model. It is also possible to combine these as as single categorical variable using paste(), although this may severely reduce your sample n in each category.

To then build the model, you could use something like:

glm(stage ~ ClinicalVar1, family="binomial", link="logit")
glm(stage ~ CD20 + CD44, family="binomial", link="logit")

Hope that this helps! - efharisto!

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin, -----one more time thank you for your consideration on this matter-----

in order to recap and hopefully to see if i have understood your approach:

Regarding 1 & 2: i have created some initial histograms and density plots for each of the 8 continuous variables: the most common pattern of these distributions, is that there are mostly skewed (for istance towards zero), and with some extreme values-possibly outliers, and they also have different ranges for this. Thus, this was my issue for using lm, because of the normality assumptions etc. Of course, there are various ways of "putative" normalization, but it could be very "difficult" to justify one or another. Perhaps, i could scale all my needed features-that is, both my normalized gene expression and the PET variables-and then use for instance linear models. Also, as you said there is certainly no "right" answer for this !

For 3 and 4 to make a combination : as you have mentioned, i would like to inspect if from any of the continuous clinical variables, i could "predict" the change in expression in any of the genes in my signature. But this, would mostly be appropriate to check if "happens" in different groups of cancer samples :

for instance, any of the clinical vars have a "significant" association and been able to predict the alteration in any of the genes in the N1/N2 samples, but not in the N0 group ---this would be the final essential point to test. Thus, based on your suggestions:

in able to catch more general associations, if i would like to use glm instead of lm, and test the above hypothesis like the following:

dd1 <- lm(CD44~Clinical Var1 +Clinical Var2..+8, data=data1) # the N1/N2 group

and dd2 <- lm(CD44~Clinical Var1 +Clinical Var2..+8, data=data2) # the N0 group

and this would be an example for only a specific gene ? while i can add all the PET variables, or for instance only the ones that showed a positive correlation ?

this, could somehow implemented with glm ? or with glm (and please correct me if i have understood wrong), you only test if the gene or clinical variable in each case is associated with any of the categorical levels ? but not of course any linkage between genes and clinical variables ?

Finally, if my notion is correct and for the genes/clinical vars association, i could use something alternative for lm, without need for normality assumptions/and only linear associations, which would be the most feasible option ?

Best,

Efstathios-Iason

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

0

Entering edit mode

Hi Efstathios,

For the 8 variables that you have, it sounds like the distribution is too skewed such that they could be used for lm() in their current form. What happens when you log these variabes? - does that produce a more linear distribution? If the distribution is too skewed right now, then you could possibly convert these cintinuous variables into categorical variables, like:

<=5
<=10
<=50
<=100

...then compare these in a multinomial gistic regression model (glm()).

in able to catch more general associations, if i would like to use glm instead of lm, and test the above hypothesis like the following:

dd1 <- lm(CD44~Clinical Var1 +Clinical Var2..+8, data=data1) # the N1/N2 group

and dd2 <- lm(CD44~Clinical Var1 +Clinical Var2..+8, data=data2) # the N0 group

and this would be an example for only a specific gene ? while i can add all the PET variables, or for instance only the ones that showed a positive correlation ?

Yes, you can do it this way, i.e., separating by group (N0 and N1/N2). The important thing is that, whichever way you do it, you just have to ensure that you understand how to interpret the result based on the way that you do it. You can include all of the variables in these models and then check them with the summary() function, e.g., summary(dd1); summary(dd2), and then remove any variables that are not statistically significant. At the end, generally, you should only include the statistically significant variables in the model. For N0 and N1/N2, I would expect the final models to have different variables.

After you have decided on a final model, you can check the validity of the model through cross-validation analysis, and also derive odds ratio. Take a look at my answer here: C: Resources for gene signature creation

Kind regards, Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

A) Firstly, regarding the nature and characteristics of the 8 continuous clinical variables: below i have created some density plots of the most variables in raw scale and in log2 transformation-unfortunately, whereas in some variables, the distribution curve becomes more "bell-shaped", in the majority of the others, due to a plethora of values near zero, after log2 transform many and large negative values appear.

i thought in parallel other kinds of normalization, like box-cox transformation, but again im not certain if it is more "appropriate" than to leave them on raw scale, and perhaps try something else. For instance, also i post below an additional qqplot of one clinical variable with raw values and after boxcox transfrom:

https://www.dropbox.com/s/vmgmv5gq0wspu6j/plot.density.pet.var2.png?dl=0 https://www.dropbox.com/s/and437c5q7ka82l/plot.density.pet.var3.png?dl=0 https://www.dropbox.com/s/fotcftthqgdzbtu/plot.density.pet.var4.png?dl=0 https://www.dropbox.com/s/zfgnbopczu28ahq/plot.density.pet.var5.png?dl=0 https://www.dropbox.com/s/ps7tro3her7jow1/plot.density.var1.png?dl=0 https://www.dropbox.com/s/76es3hatfx8l2h4/var8.inspect.png?dl=0 https://www.dropbox.com/s/9sk9hpd6bbaunaj/qqplot.suv.boxcox.transform.png?dl=0

B) Thus, to summarize in relation to the above part, after these exploratory plots, what is your opinion ? I could use lm(), but for instance standardize (center and scale) all my used continuous variables ? That is for both clinical & gene expression data ? Or generally due to the assumptions of lm, another model would be more valid to use ?

For example, in case i would use glm, like lm, how i could appropriately implement it:

as for instance:

dd1 <- lm(CD44~Clinical Var1 +Clinical Var2..+8, data=data1) # is for examining the "dependence" association of a specific genes to the clinical variables, but for a specific subset of the samples ? in the "sake of prediction inspection" ?

## whereas, for instance:

glm(stage ~ ClinicalVar1, family="binomial", link="logit") # see if there is any signficant association of the specific clinical variable to any of the group levels ?

Or i should formulate the glm formula above differently for my purpose discussed above ?

Please excuse me for my last question, but this distinction is a bit confusing, and need to be certain in order to move to further steps, such as odds ratio, etc

Best,

Efstathios-Iason

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Hello, based on that information (i.e. after having seen the distributions of the variables), I would only use a generalised linear model (glm()), and I would only this with the unlogged variables and the following parameter: family="Gamma"(link="log")

For example:

glm(stage ~ ClinicalVar1 + ClinicalVar2 + ... ClinicalVarN, family="Gamma"(link="log"))

Then choose the best variables from the summary() function applied to the model and re-run the model.

If you wish to create a model with your clinical variables and a gene (e.g. CD44), then also use glm():

glm(CD44 ~ ClinicalVar1 + ClinicalVar2 + ... ClinicalVarN, family="Gamma"(link="log"))

The only situation where you could really justify using the lm() with your gene is if the ClinicalVariable was categorised, such as 'LessThanZero' and 'GreaterThanZero'.

lm(CD44 ~ factor(ifelse(ClinicalVar1<0, "LessThanZero", "GreaterThanZero"), levels=c("LessThanZero", "GreaterThanZero"))

Remember that there is no correct way here! Someone will always find a way to criticise your work at every possible chance they can get.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello Kevin, thank you for your further clarifications-thus in order to finally summarize my thoughts, and being sure that i understood correctly your examples, specifically for the code implementation-Thus:

1) In any of the above cases, it is "safer" to use glm but with distinct purposes. In other words:

A) With glm(stage ~ ClinicalVar1 + ClinicalVar2 + ... ClinicalVarN, family="Gamma"(link="log"),data=data):

with the above, i essentially use all my available dataset with the categorical variable stage, and inspect if there is any significant association with any of the two groups with any of the clinical variables, correct ? and could further move with odds ratio, etc

B) Whereas with:

data1 <- glm(CD44 ~ ClinicalVar1 + ClinicalVar2 + ... ClinicalVarN, family="Gamma"(link="log"),data=data1)

as also for the data argument, i should separate my data frame into 2 dataframes, as discussed above ? and also with data=data2, correct ? And this would show a putative association of any of the clinical vars to predict the value of a speficic gene, correct ? with interest in differences between the two groups ?

Finally, you last implementation with lm might be also performed, by example fitting a median threshold in the values of each variable, and perform the same approach.

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Hey Efstathios,

Yes, that is a 'yes' to all of your questions, including the dividing of the data into data1 (N0) and data2 (N1/N2) when using the CD44 model.

Just remember, too, that testing each clinical variable independently is different from testing them together. You can do both types and check the results.

For example:

glm(stage ~ ClinicalVar1, family="Gamma"(link="log"),data=data)

This asks: Is there an association between ClinicalVar1 and tumour stage?, i.e., can we use ClinicalVar1 to predict very well the tumour stage.

glm(stage ~ ClinicalVar1 + ClinicalVar2, family="Gamma"(link="log"), data=data)

Can the combination of Var1 and Var2 predict very well tumour stage? In this case, if ClinicalVar1 is highly significant on its own but ClinicalVar2 is not, then this model will still be poor. Each variable's information is 'adjusted' based on all other variables in the model. The quickest way to see which variables are good is by using the summary() function. To then check the overall power of the final model, which may have stage ~ Var1 + Var5 + Var7, you have to do cross-validation and ROC analysis, which I mentioned at the very beginning (and also derive the odds ratios).

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Kevin, thank you for your confirmation-i suppose the above example suits for the implementation of glm with one categorical variable and 1 or more continuous clinical, correct ?

whereas the implementation with both continuous groups, will just identify-if any-signficant associations with any of the vars with the gene right ?

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Yes, this: glm(CD44 ~ ClinicalVar1, family="Gamma"(link="log"), data=data1) is checking for an association between ClinicaVar1 and the expression levels of CD44 in the N0 group (Can we use ClinicalVar1 to predict the expression of the CD44 gene in N0 patients?). We use Gamma distribution due to the distribution of your ClinicalVar.

By the way, if you divide your dataset into N0 and N1/N2, you should also check this model: glm(NodesInvolved ~ CD44, data=data). In order to justify the division into N0 and N1/N2 for the CD44 model, you may have to first show that CD44 is statistically significantly different between N0 and N1/N2. This is likely a question from a statistician looking at your work in the future.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes exactly Kevin,

you got my point before answer it !! (also assume that with NodesInvolved above you meant the categorical variable for Lymph Node status that has the 2 levels discussed)--I could also perhaps implement statistical comparisons with limma and interaction models (etc),-but would be far more complicated for the design matrix, etc-. Thus, this approach as an alternative looks the same valid and more simple. Also, I don't know about any general issues that could be adjusted for confunders as with limma (but this could be another whole topic for discussion)-, but this is one approach straight to the point:

1) I could use your implemented loop for checking significant association between any of the gene signature and either tumor stage groups or lymph nodes:

for (i in 1:length(genelist))
{
   formula <- as.formula(paste("TumourStage ~ ", genelist[i], sep=""))

   model <- glm(formula, data=MyData, family=binomial)

   etc.
}

2) And then, for any significant resulted genes (with additional odds ratio and anova test provided), inspect subsequently any putative association and prediction from any of the pets, as discussed above:

For example:

glm(CD44 ~ ClinicalVar1, family="Gamma"(link="log"), data=data1)

My extra consern, is if i would have to do any multiple testing correction, but im not sure it applies here in the exact sense like microarrays or RNA-Seq.

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Hello, yes, that seems like a good plan (and, 'yes', use family=binomial in the first part with the loop). If you want to parallelise the loop to make it run quicker, you can take a look at my code here (see 'Any type of LM, GLM, clogit (here: clogit)'): R functions edited for parallel processing (this may be too complex for now, though).

Then, yes, you can check the significant genes in the other model ('2)').

You do not need to worry about limma here - I do not believe that limma is a good choice (because limma is fundamentally based on linear models). You neither need to worry about multiple-test corrections; however, as I said before, some person will always try to find a way to criticise - you just need to prepare for that.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Thanks for the additional links- i will now hopefully try to summarize and finally implement all the above exciting things-as for your last comment, of course two famous phrases suit here: "Essentially, all models are wrong, but some are useful", or even " among competing hypotheses, the one with the fewest assumptions should be selected"...

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Πολύ καλά! Να με ενημερώνετε για την πρόοδο

Very good! - keep me updated

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Just one more thing: you will also likely receive criticism for using the gamma distribution in the second model, but remember that it is due to the distribution of the clinical variables, which is clearly non-normal (not binomial).

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

again a very crussial comment:

i understand from a small search that this is a generally huge "debate" generally on distribution model selections in regression analysis and generally on statistical implementations, like gene expression. For example, i noticed many people talking about normality assumptions needed only in the error terms, and not the actual normal distributions of variables in linear models.

About the choise of gamma distribution--it is mainly due to the general non-normality of the clinical variables, as also for the skewness noticed in various distributions of these, right ? and binomial in the first place, as the genes are normalized correct ?

but again, in the second model with the gamma distribution, any significant association detected, for instance would still related to ie.:

an increase in one clinical variable, indicate also increase of the specific gene, right ? although not with the same sence of " direct linearity" with linear models ?

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

1

Entering edit mode

Yes, I think that gamma is better due to the skewness of your clinical variables. However, we are regressing these clinical variables against a normal distribution (the gene CD44).

I still believe that Gamma is the best choice here, but you will have to let the results guide you. If you wish, you can do further simple diagnostics by plotting a simple linear fit to visually test the assumption that the association is linear:

linearfit <- lm(CD44 ~ ClinicalVar1)
points(ClinicalVar1, CD44, pch=20)
abline(linearfit, col="red", lwd=2)

This will fit a linear line, but the relationship between x and y may look anything but linear, in which case we can support the choice of Gamma.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Probably,

i will have to test it for each of any significant genes identified in the first step, with each of the clinical variables. And hope the the majority of clinical variables follow the pattern we are discussing. The data will probably show.

ADD REPLY • link 6.5 years ago by svlachavas ▴ 790

score 3 · Answer 1 · 2017-10-19

Due to the fact that the outcome variables are categorical, you would have to use glm() and perform binomial/binary or multinomial logistic regression. You would essentially be testing the hypothesis that the expression of your genes of interest are related to the outcome(s) of interest.

First check that your categorical variables are ordered correctly

Convert your variables to factors with factor() and ensure that you choose the reference ('base') category. For example:

factor(MyData$TumourStage, levels=c("StageI", "StageII", "StageIII", "StageIV"))

Here, StageI will be assigned as the reference category automatically. You can also set the reference category specifically with:

relevel(MyData$TumourStage, ref="StageI"))

Test each gene independently

For example, you could initially test each gene independently.

glm(TumourStage ~ gene1, data=MyData, family=binomial)
glm(TumourStage ~ gene2, data=MyData, family=binomial)
...
glm(TumourStage ~ geneX, data=MyData, family=binomial)

You can then check the coefficients, derive odds ratios (ORs), and also a P-value by applying Chi-square ANOVA to the model:

mymodel <- glm(TumourStage ~ gene1, data=MyData, family=binomial)
summary(mymodel)
exp(cbind(OR=coef(mymodel), confint(mymodel))) #ORs
anova(mymodel, test="Chisq") #Chi-square ANOVA

Build predictive models using multiple genes

You could also build predictive models and choose the best predictors through step-wise regression:

null <- glm(TumourStage ~ 1, data=MyData, family=binomial)
full <- glm(TumourStage ~ gene1 + gene2 + ... + geneX, data=MyData, family=binomial)
require(MASS)
forward <- stepAIC(null, scope=list(lower=null, upper=full), direction="forward") #forward stepwise regression
backward <- stepAIC(full, direction="backward") #backward stepwise regression
both <- stepAIC(null, scope=list(lower=null, upper=full), direction="both") #forward + backward stepwise regression

Then, check each of these models to see which one is best. The interpretation of the word 'best', here, is very much open for discussion! One way to test each model is through cross-validation and ROC analysis

Cross-validation analysis

Here, 'K' is the number of chunks into which you will divide your dataset, and also the number of times that you will repeat the cross-validation. A value of 3 is common, but here I use 10-fold. K can also equal the total number of samples in the dataset, in which case the type of cross-validation is known as 'leave-one-out' cross-validation.

require(boot)
cv.glm(MyData, full, K=10)$delta
cv.glm(MyData, both, K=10)$delta
cv.glm(MyData, forward, K=10)$delta
cv.glm(MyData, backward, K=10)$delta

Then, plot deltae from the different models. The smallest difference and lowest values of deltae may represent the 'best' model

ROC analysis

ROC analysis with 95% confidence intervals
require(pROC)
roc <- roc(MyData$TumourStage, fitted(full), smooth=FALSE)
plot.roc(roc, grid=TRUE, grid.lwd=2)
text(0.5, 0.0, paste("AUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.8)

roc <- roc(MyData$TumourStage, fitted(both), smooth=TRUE)
plot.roc(roc, grid=TRUE, grid.lwd=2)
text(0.5, 0.0, paste("AUC (L95, AUC, U95):", paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "), cex=0.8)

et cetera.

Perform lasso-penalised, elastic-net, or ridge regression (if many predictors)

See my answers on these Biostars threads: