A simple regression in R
1
0
Entering edit mode
3.3 years ago
Learner ▴ 250

I am looking for a way to calculate AUC for a drug response . The data looks like the following

df<- structure(list(Compound = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), conc = c(0.00372,
0.01368, 0.05004, 0.1836, 0.6732, 2.472, 9.048, 33.24, 121.2,
446.4, 1632, 6000), Count = c(10200000, 10300000, 10200000, 10200000,
9927000, 9400000, 8464040, 6513840, 4349320, 3071480, 5989320,
8444560)), class = "data.frame", row.names = c(NA, -12L))


is there anyone who could help me

R • 2.2k views
4
Entering edit mode
3.3 years ago

df
Compound      conc    Count conc.pred
1         A 3.720e-03 10200000  565.1263
2         A 1.368e-02 10300000  559.3499
3         A 5.004e-02 10200000  565.1263
4         A 1.836e-01 10200000  565.1263
5         A 6.732e-01  9927000  580.8959
6         A 2.472e+00  9400000  611.3376
7         A 9.048e+00  8464040  665.4025
8         A 3.324e+01  6513840  778.0542
9         A 1.212e+02  4349320  903.0858
10        A 4.464e+02  3071480  976.8991
11        A 1.632e+03  5989320  808.3526
12        A 6.000e+03  8444560  666.5278


What are you aiming to model? You cannot do anything with the compound column because it only has one value.

The only thing that we can do here is build a linear regression model between conc and Count:

model <- lm(conc ~ Count, data = df)
summary(model)

Call:
lm(formula = conc ~ Count, data = df)

Residuals:
Min     1Q Median     3Q    Max
-781.9 -620.7 -565.1 -552.1 5333.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.154e+03  1.836e+03   0.629    0.544
Count       -5.776e-05  2.176e-04  -0.265    0.796

Residual standard error: 1816 on 10 degrees of freedom
Multiple R-squared:  0.006999,  Adjusted R-squared:  -0.0923
F-statistic: 0.07049 on 1 and 10 DF,  p-value: 0.796


It looks like there is no statistically significant relationship between these.

We can assess R^2 shrinkage using 3-fold cross-validation:

library(bootstrap)
theta.fit <- function(x, y) { lsfit(x,y) }
theta.predict <- function(fit, x) { cbind(1,x) %*% fit$coef } x <- as.matrix(df[,which(colnames(df) %in% names(summary(model)$coefficients[,1]))])
y <- as.matrix(df[,c("conc")])
results <- crossval(x, y, theta.fit, theta.predict, ngroup = 3)

# Raw R^2
cor(y, model$fitted.values)**2 [,1] [1,] 0.006999459 # cross-validated R^2 cor(y, results$cv.fit)**2
[,1]
[1,] 0.07150013


It does not look like a good model, in fairness.

We can also plot the model-predicted conc values versus the real ones:

require(ggplot2)
df$conc.pred = predict(model) ggplot(df, aes(x = conc, y = conc.pred)) + geom_point() + geom_smooth(method='lm',formula=y~x) + stat_smooth(method="lm", fullrange=TRUE) + ggtitle("My linear model")  The confidence intervals (grey shade) are extraordinary. Finally, we can plot the ROC curve: require(pROC) roc <- roc(df$conc, fitted(model), smooth=FALSE)
plot.roc(
roc,
grid=TRUE,
grid.lwd=2,
col="royalblue",
main="My linear model")
text(0.3, 0.45,
col="royalblue",
paste("AUC (L95, AUC, U95):",
paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
cex=1.0)


Hopefully this can 'foment' some ideas for you.

If you have other compounds in your data and want to model them as the y variables in the model, then you can switch to using a generalised linear model (glm()) in place of the linear model (lm()).

Kevin

0
Entering edit mode

@Kevin Blighe Thanks Kevin, I thought that the concentration-response curves is based on the percent-viability scores which is often fit using smooth cubic splines for multivariate data . what I am mainly looking for is to calculate the AUC which should be only a value, right?

1
Entering edit mode

I wrote and re-wrote the response to your comment a few times and decided on this: you should have added much more information to your original question (or perhaps I should have requested it). In my very brief example that I give (above), if the model assumptions were correct, then you would have a single value for AUC.

You are talking about multivariate data, but what do you mean? You just have 3 columns in your data, and the Compound column is just all 'A' values.

0
Entering edit mode

@Kevin Blighe, So, let me give more information and see your comments, I will absolutely like and accept your answer even if it does not lead me to a correct answer. So, lets assume that I have a cell line which I treat with a compound in several concentration and I obtain the sensitivity values. If I have a replicate, should I treat them as a multivariate calibration or just do it over each replicate separately?

0
Entering edit mode

There is no right or wrong, I think. You can try them separate and then together (in a multivariate analysis). In the multivariate analysis, you could probably add cell-line as a covariate to see if there are cell-line specific effects in relation to drug sensitivity, too.

0
Entering edit mode

@Kevin Blighe could you please use the following data in your example? I am just wondering how you could do that

df<-structure(list(Drug_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "DrugT", class = "factor"),
Cell_line = structure(c(2L, 3L, 1L, 4L, 5L), .Label = c("Celline3",
"Cellline1", "Cellline2", "Cellline4", "Cellline5"), class = "factor"),
Concentrations_nM = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "3.1e-05,0.000114,0.000417,0.00153,0.00561,0.0206,0.0754,0.277,1.01,3.72", class = "factor"),
Responses = structure(c(4L, 2L, 3L, 5L, 1L), .Label = c("1,0.948118156866697,0.920303987764611,1.03610743904536,1.08332987533419,0.960086785898477,0.765642506120658,0.572520170014998,0.375835106792894,0.254180720963181",
"1,0.953703703703704,1,0.972222222222222,0.972222222222222,0.911407407407407,0.835722222222222,0.651211111111111,0.454325925925926,0.293437037037037",
"1,0.97679357161345,0.923286430268948,0.933051365914492,0.938592672768771,0.834813760323772,0.738458551965929,0.56534271394621,0.428992446881103,0.287483234899645",
"1,1.02970297029703,1.01980198019802,1.00990099009901,1.03960396039604,0.945219801980198,0.887081188118812,0.657615841584158,0.454776237623762,0.319762376237624",
"1,1.03893565587121,0.967905086838,1.06062943976775,0.886153353792697,0.897337425206165,0.719798149513376,0.553963001697327,0.372291848829301,0.301689897760328"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))

3
Entering edit mode

Hey, can you not simply repeat my same code with your new data?

0
Entering edit mode

@Kevin Blighe do you know the cell to cell interaction

0
Entering edit mode

@Kevin Blighe you wrote me this to see if there are cell-line specific effects in relation to drug sensitivity actually I have been reading a lot about this. no one uses lm for such a analysis. the worse one can do is liner because basically it is not linear but nonlinear and sigmoidal ! however, thanks for all your time