Question: A simple regression in R
0
10 months ago by
Learner 200
Learner 200 wrote:

I am looking for a way to calculate AUC for a drug response . The data looks like the following

``````df<- structure(list(Compound = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), conc = c(0.00372,
0.01368, 0.05004, 0.1836, 0.6732, 2.472, 9.048, 33.24, 121.2,
446.4, 1632, 6000), Count = c(10200000, 10300000, 10200000, 10200000,
9927000, 9400000, 8464040, 6513840, 4349320, 3071480, 5989320,
8444560)), class = "data.frame", row.names = c(NA, -12L))
``````

is there anyone who could help me

R • 521 views
modified 10 months ago • written 10 months ago by Learner 200
4
10 months ago by
Kevin Blighe54k
Kevin Blighe54k wrote:

``````df
Compound      conc    Count conc.pred
1         A 3.720e-03 10200000  565.1263
2         A 1.368e-02 10300000  559.3499
3         A 5.004e-02 10200000  565.1263
4         A 1.836e-01 10200000  565.1263
5         A 6.732e-01  9927000  580.8959
6         A 2.472e+00  9400000  611.3376
7         A 9.048e+00  8464040  665.4025
8         A 3.324e+01  6513840  778.0542
9         A 1.212e+02  4349320  903.0858
10        A 4.464e+02  3071480  976.8991
11        A 1.632e+03  5989320  808.3526
12        A 6.000e+03  8444560  666.5278
``````

What are you aiming to model? You cannot do anything with the compound column because it only has one value.

The only thing that we can do here is build a linear regression model between `conc` and `Count`:

``````model <- lm(conc ~ Count, data = df)
summary(model)

Call:
lm(formula = conc ~ Count, data = df)

Residuals:
Min     1Q Median     3Q    Max
-781.9 -620.7 -565.1 -552.1 5333.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.154e+03  1.836e+03   0.629    0.544
Count       -5.776e-05  2.176e-04  -0.265    0.796

Residual standard error: 1816 on 10 degrees of freedom
Multiple R-squared:  0.006999,  Adjusted R-squared:  -0.0923
F-statistic: 0.07049 on 1 and 10 DF,  p-value: 0.796
``````

It looks like there is no statistically significant relationship between these.

We can assess R^2 shrinkage using 3-fold cross-validation:

``````library(bootstrap)
theta.fit <- function(x, y) { lsfit(x,y) }
theta.predict <- function(fit, x) { cbind(1,x) %*% fit\$coef }
x <- as.matrix(df[,which(colnames(df) %in% names(summary(model)\$coefficients[,1]))])
y <- as.matrix(df[,c("conc")])
results <- crossval(x, y, theta.fit, theta.predict, ngroup = 3)

# Raw R^2
cor(y, model\$fitted.values)**2
[,1]
[1,] 0.006999459

# cross-validated R^2
cor(y, results\$cv.fit)**2
[,1]
[1,] 0.07150013
``````

It does not look like a good model, in fairness.

We can also plot the model-predicted `conc` values versus the real ones:

``````require(ggplot2)
df\$conc.pred = predict(model)
ggplot(df, aes(x = conc, y = conc.pred)) +
geom_point() +
geom_smooth(method='lm',formula=y~x) +
stat_smooth(method="lm", fullrange=TRUE) +
ggtitle("My linear model")
``````

The confidence intervals (grey shade) are extraordinary.

Finally, we can plot the ROC curve:

``````require(pROC)
roc <- roc(df\$conc, fitted(model), smooth=FALSE)
plot.roc(
roc,
grid=TRUE,
grid.lwd=2,
col="royalblue",
main="My linear model")
text(0.3, 0.45,
col="royalblue",
paste("AUC (L95, AUC, U95):",
paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
cex=1.0)
``````

Hopefully this can 'foment' some ideas for you.

If you have other compounds in your data and want to model them as the y variables in the model, then you can switch to using a generalised linear model (`glm()`) in place of the linear model (`lm()`).

Kevin

@Kevin Blighe Thanks Kevin, I thought that the concentration-response curves is based on the percent-viability scores which is often fit using smooth cubic splines for multivariate data . what I am mainly looking for is to calculate the AUC which should be only a value, right?

1

I wrote and re-wrote the response to your comment a few times and decided on this: you should have added much more information to your original question (or perhaps I should have requested it). In my very brief example that I give (above), if the model assumptions were correct, then you would have a single value for AUC.

You are talking about multivariate data, but what do you mean? You just have 3 columns in your data, and the `Compound` column is just all 'A' values.

@Kevin Blighe, So, let me give more information and see your comments, I will absolutely like and accept your answer even if it does not lead me to a correct answer. So, lets assume that I have a cell line which I treat with a compound in several concentration and I obtain the sensitivity values. If I have a replicate, should I treat them as a multivariate calibration or just do it over each replicate separately?

There is no right or wrong, I think. You can try them separate and then together (in a multivariate analysis). In the multivariate analysis, you could probably add `cell-line` as a covariate to see if there are cell-line specific effects in relation to drug sensitivity, too.

@Kevin Blighe could you please use the following data in your example? I am just wondering how you could do that

``````df<-structure(list(Drug_name = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "DrugT", class = "factor"),
Cell_line = structure(c(2L, 3L, 1L, 4L, 5L), .Label = c("Celline3",
"Cellline1", "Cellline2", "Cellline4", "Cellline5"), class = "factor"),
Concentrations_nM = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "3.1e-05,0.000114,0.000417,0.00153,0.00561,0.0206,0.0754,0.277,1.01,3.72", class = "factor"),
Responses = structure(c(4L, 2L, 3L, 5L, 1L), .Label = c("1,0.948118156866697,0.920303987764611,1.03610743904536,1.08332987533419,0.960086785898477,0.765642506120658,0.572520170014998,0.375835106792894,0.254180720963181",
"1,0.953703703703704,1,0.972222222222222,0.972222222222222,0.911407407407407,0.835722222222222,0.651211111111111,0.454325925925926,0.293437037037037",
"1,0.97679357161345,0.923286430268948,0.933051365914492,0.938592672768771,0.834813760323772,0.738458551965929,0.56534271394621,0.428992446881103,0.287483234899645",
"1,1.02970297029703,1.01980198019802,1.00990099009901,1.03960396039604,0.945219801980198,0.887081188118812,0.657615841584158,0.454776237623762,0.319762376237624",
"1,1.03893565587121,0.967905086838,1.06062943976775,0.886153353792697,0.897337425206165,0.719798149513376,0.553963001697327,0.372291848829301,0.301689897760328"
), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
``````
ADD REPLYlink modified 10 months ago by Kevin Blighe54k • written 10 months ago by Learner 200
3

Hey, can you not simply repeat my same code with your new data?

@Kevin Blighe do you know the cell to cell interaction

@Kevin Blighe you wrote me this `to see if there are cell-line specific effects in relation to drug sensitivity` actually I have been reading a lot about this. no one uses lm for such a analysis. the worse one can do is liner because basically it is not linear but nonlinear and sigmoidal ! however, thanks for all your time

no one uses lm for such a analysis. the worse one can do is liner because basically it is not linear but nonlinear and sigmoidal !

Only a naïve person would rigidly adhere to this statement ^