How to calculate the sensitivity and specificity of set of expression dataset used in a validation cohort
Entering edit mode
3.8 years ago
a.james ▴ 240

Hello All,

I have a set gene which I used on a validation cohort, to find its ability to specifically classify cancer subtype, now I need to calculate the specificity and sensitivity of these genes for classifying the cancer subtypes.

What would be the correct statistical approach/ tests to go for ward. I have ~1300 gene expression dataset from 58 leukemia samples (The validation cohort). The original cohort is 82 samples and same gene set. In both original and validation cohort on a unsupervised HC the geneset is concordant with classifying the molecular subtypes. now the question is how to calculate their sensitivity and specificity.

In my case I am wondering how to set the values for Sensitivity=True Positives (TP)/(TP+FN (False negative)), Specificity=TN (True Negative)/(FP(False Positive)+TN), in terms of TP, TN, FP, and FN.

Also, I would need more dataset to validate the specificity of these genes. These genes originally discovered and validated in Acute lymphoblastic leukemia samples. I would to have some suggestions on dataset based validations for this gene set and where to get more RNA-seq dataset, and what are the specific methods to validate or define the significance of these genes statistically.

gene RNA-Seq • 3.1k views
Entering edit mode
3.8 years ago

I presume that you know how to derive a final predictive model? 1,300 genes seems too large. I have posted various answers in relation to this, such as:

Once you decide on your final model, you can test sensitivity, specificity, precision, accuracy, and AUC (and other metrics) by applying your model to the same data on which your model was created and also new data, i.e., a validation cohort.


  • our regression model is called bayesmodel
  • our data (the data on which the model was built or the new data) is called modeling
  • our outcome / phenotype is stored in a variable called Disease (0 / 1) in modeling



make a model prediction

pred <- prediction(predict(bayesmodel, type="response"), modeling$Disease)

determine sensitivity and specificity

ss <- performance(pred, "sens", "spec")

print("Predicted probability cut-off:")
ss@alpha.values[[1]][which.max(ss@x.values[[1]] + ss@y.values[[1]])]

predicted.prob <- round(ss@alpha.values[[1]][which.max(ss@x.values[[1]] + ss@y.values[[1]])], 2)

print("Sensitivity and specificity, weighed equally, are:")
max(ss@x.values[[1]] + ss@y.values[[1]]) / 2

sens.spec <- round(max(ss@x.values[[1]] + ss@y.values[[1]]) / 2, 2)

determine cost function

cost.perf <- performance(pred, measure="cost")
cost <- pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]

determine precision

prec.perf <- performance(pred, measure="prec")
ind <- which.max(slot(prec.perf, "y.values")[[1]])
prec <- slot(prec.perf, "y.values")[[1]][ind]
prec.cutoff <- slot(prec.perf, "x.values")[[1]][ind]
print(c(precision=prec, cutoff=prec.cutoff))

determine accuracy

acc.perf <- performance(pred, measure="acc")
ind <- which.max(slot(acc.perf, "y.values")[[1]])
acc <- slot(acc.perf, "y.values")[[1]][ind]
acc.cutoff <- slot(acc.perf, "x.values")[[1]][ind]
print(c(accuracy=acc, cutoff=acc.cutoff))

plot ROC 'curve' and optimum accuracy

par(cex=0.8, mfrow=c(1,2))

roc <- roc(na.omit(modeling$Disease), predict(bayesmodel, type="response"), smooth=FALSE)
plot.roc(roc, add=FALSE, grid=TRUE, grid.lwd=2, col="red", main="")
text(0.25, 0.2, col="red",
    paste("Variant signature:\nAUC [L95-U95]:",
        paste(round(ci.auc(roc)[2], 3), " [",
            round(ci.auc(roc)[1], 3), "-",
            round(ci.auc(roc)[3], 3), "]", sep="", collapse=","), sep=" "),
    cex=1.1, font=2)
text(0.25, 0.05, col="red",
    paste("Sensitivity/Specificity: ", sens.spec*100, "%", sep=""), cex=1.1, font=2)

# accuracy
plot(acc.perf, main="Accuracy, variant signature", avg="vertical", spread.estimate="boxplot")
abline(v=acc.cutoff, lwd=1, lty=2, col="red")


Ignore the odd looking ROC curve for now. I`ve also edited out key info form these plots/ hence, the weird text labels.

Hope that it helps!


Entering edit mode

Thanks Kevin for the answer. The predictive model should be appiled on raw count or normlaized expression count? Also, in this case (RNA-seq) Bayseian predictive model would be a good choice?

Entering edit mode

You should build your model on the normalised, logged counts, which would hopefully follow a normal / Gaussian distribution. If you use the normalised, un-logged counts, then you`d have to run a negative binomial regression via glm.nb, as this would expect your data to be distributed as a negative binomial (a distribution that RNA-seq raw and normalised un-logged counts naturally follow). When you log these or scale them to Z-scores, they then more follow a Gaussian.

You do not have to use a Bayesian logistic regression model. I did in the case above for a very specific reason. You could easily just use glm() (generalised linear model) in R with family=binomial(link="logit") - this, them, assumes that your y variable is encoded as 0/1, 0/1/2, et cetera, and that your data is normally-distributed.

Entering edit mode

Thanks for the reply. In my case I have no pheotype/metadata information other than defined 3 subtypes and the age group (adults or pediatric patients information). And I tried lm() model on the dataset, using top 5 Up-regulated genes from 3 subtypes to predict the lm() model. Then used stepAIC for prediction score and later to use model prediction using prediction. I am not sure is this the right way I am doing, it would be great if you could walk me little more through this thank you. Additionally, I have modified my question alittle bit. The main part of question is: In both original and validation cohort the unsupervised HC on the differentially expressed (DE) 1300 genes showed a concordance in classifying the molecular subtypes of leukemia on both original and validation cohort. Now, the question is how to calculate the sensitivity and specificity of these 1300 genes?

Entering edit mode

I see, so, the 1300 genes segregate the sub-types via hierarchical cluster. What I would do after that is:

  1. penalised, cross-validated regression via cv.glmnet()
  2. take top 100 or 150 predictors from cv.glmnet() and reduce further via stepAIC()
  3. build different final models from stepAIC() output

With your final models, you would then test for:

  • r-squared shrinkage
  • sensitivity
  • specificity
  • AUC
  • et cetera

Hopefully, your final models will be fairly compact and have high prediction capability.

Alternatively, you could do something like RandomForest on the 1300 genes, or just go ahead and fit a weighted Bayesian multinomial logistic regression model (bayesglm()), where the y variable encodes the 3 sub-types. The x predictors would be your large list of genes. Standard glm will not work because there are too many variables.

There is never a fixed way of doing this, and I find that the methodology changes each time that one needs to build a model. In the approach that you describe above, you indicate that you just took 5 genes from each sub-type. Did you identify these 5 as the most differentially expressed?

Here is another article, which seems useful:

Entering edit mode

Thank you sir, for detailed description. I would follow those steps. Yes these five were most significantly up-regualted within each subtype


Login before adding your answer.

Traffic: 1266 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6