Question

I need help with univariate logistic regression in a set of microarray data

0

Entering edit mode

6 months ago

Pooria • 0

Hello dear friends.

I'm kind of stuck on some analyses.

I have expression values of a gene in 97 samples. about half of these samples are healthy, and the others are patients.

Now, I want to perform a univariate logistic regression and predict disease occurrence based on the gene expression level. I used the glm() function as below, but I got a very very huge odds ratio and CI. My data doesn't have any NA value, also I don't think there is any outlier, because the values are in the same range. (all of them are between 8 - 10, maybe a little up or down) . Also, the p-value is highly represents meaningful. I don't know what's the problem and I have searched for that for hours, but I could not fix that. I'd appreciate it if you share your answers.

### FRGmetadata is a data frame in which its columns refer to different variables, and the rows are sample GSM numbers.
model <- glm(formula = status~TXN, data = FRGmetadata, family = binomial)

Utmost sincerity

R microarray odds-ratio • 466 views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 6 months ago by Pooria • 0

Ram · Answer 1 · 2023-10-03

LR can be sensitive to the scale of your predictor variable. Since your gene expression values are all within a small range (between 8 and 10), LR may struggle to fit the data properly.

One way to overcome this is by scaling your predictor variable.

This can be done using the scale() function in R, which will center and scale your data to have a mean of 0 and a standard deviation of 1.

This can help improve the numerical stability of the logistic regression model.

FRGmetadata$TXN_scaled <- scale(FRGmetadata$TXN)
model <- glm(formula = status ~ TXN_scaled, data = FRGmetadata, family = binomial)

You can also check Sample Size, Multicollinearity,

Note: Ensure that the biological context supports the use of logistic regression for predicting disease occurrence based on this gene expression data.

Sometimes, logistic regression might not be the most appropriate method if the relationship between the gene expression and disease is not inherently logistic in nature.

score 0 · Answer 2 · 2023-10-03

got a very very huge odds ratio and CI

You have a coefficient but not an odds ratio. An odds ratio is for a contingency table (disease/healthy VS groupA/groupB). What you have is a continuous variable, and the coefficient in the regression will be relative to the scale of the variable, and thus cannot be interpreted as an odds ratio. That the CI is also large indicates that there's probably high variance within groups.

This is not an issue with the glm function, just with the interpretation of logistic regression for a continuous predictor.

A typical approach to construct an "odds ratio" in this case would be to repeat the logistic regression as follows:

FRGmetadata$TXN_quantile <- rank(FRGmetadata$TXN_scaled)/nrow(FRGmetadata)
FRGmetadata$TXN_group <- ifelse(FRGmetadata$TXN_quantile <= 0.2,
                                'low_TXN', ifelse(FRGmetadata$TXN_quantile >= 0.8, 
                                'high_TXN', 'intermediate_TXN'))
glm(formula = status~TXN_group, 
     data = subset(FRGmetadata, TXN_group != 'intermediate_TXN'), 
     family = binomial)

this would give you the odds ratio of the high-transcription group vs the low-transcription group.