Calculating ROC50 in R
1
0
Entering edit mode
5 months ago
siu ▴ 160

Hi everyone,

I have a question regarding the ROC50 calculation for protein remote homology detection.

I have gone through different papers

"Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching"
"https://www.nature.com/articles/srep32333#Sec6"
"https://www.pnas.org/doi/10.1073/pnas.0308067101#sec-1"
"https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1842-2"

I have done homology search of my queries and I have the table (almost 50000 hits)

Query   Query_Fam   Target  Target_Fam  eValue  bitScore
d1i50a  PF04997 d1i50a  PF04997 0   2889
d1i50b  PF04565 d1i50b  PF04565 0   2420
d1i6vd  PF00623 d1i6vd  PF00623 0   2327
d1i6vc  PF04563 d1i6vc  PF04563 0   2194
d1htya  PF09261 d1htya  PF09261 0   2098
d1eula  PF00689 d1eula  PF00689 0   1974
d1qbkb  PF03810 d1qbkb  PF03810 0   1801
d1ygpa  PF00343 d1ygpa  PF00343 0   1774
d1ceza  PF14700 d1ceza  PF14700 0   1774
d1fiy   PF00311 d1fiy   PF00311 0   1749
d1qgra  PF13513 d1qgra  PF13513 0   1730
d2btva  PF01700 d2btva  PF01700 0   1693
d1a8i   PF00343 d1a8i   PF00343 0   1683
d1em6a  PF00343 d1em6a  PF00343 0   1637
d1qm5a  PF00343 d1qm5a  PF00343 0   1623
d2mysa2 PF00063 d2mysa2 PF00063 0   1603
g1gk9.1 PF01804 g1gk9.1 PF01804 0   1585
d1b7ta4 PF00063 d1b7ta4 PF00063 0   1584

The true positives and False positives labels will be based on the protein families to which the Query and Target proteins belong to.

I am not getting that how they are plotting the "Proportion of protein with given performance vs ROC50 values". Please let me know, If anyone aware of this problem and how to do it in R .

Thank you so much

ROC Protein R Homology • 713 views
ADD COMMENT
0
Entering edit mode

I think you'd first need to know which are true positives and false positives

ADD REPLY
0
Entering edit mode

I will label a protein (Target) as a true positive if it belongs to the same protein family (Query), as indicated in the 'Query Fam' and 'Target Fam' columns. If the proteins do not belong to the same family, I will label them as false positives."

ADD REPLY
0
Entering edit mode
5 months ago
tab <- read.table(text = "Query   Query_Fam   Target  Target_Fam  eValue  bitScore  hit
d1i50a  PF04997 d1i50a  PF04997 0   2889  TRUE
d1i50b  PF04565 d1i50b  PF04565 0   2420  TRUE
d1i6vd  PF00623 d1i6vd  PF00623 0   2327  TRUE
d1i6vc  PF04563 d1i6vc  PF04563 0   2194  TRUE
d1htya  PF09261 d1htya  PF09261 0   2098  FALSE
d1eula  PF00689 d1eula  PF00689 0   1974  TRUE
d1qbkb  PF03810 d1qbkb  PF03810 0   1801  TRUE
d1ygpa  PF00343 d1ygpa  PF00343 0   1774  TRUE
d1ceza  PF14700 d1ceza  PF14700 0   1774  FALSE
d1fiy   PF00311 d1fiy   PF00311 0   1749  TRUE
d1qgra  PF13513 d1qgra  PF13513 0   1730  FALSE
d2btva  PF01700 d2btva  PF01700 0   1693  TRUE
d1a8i   PF00343 d1a8i   PF00343 0   1683  TRUE
d1em6a  PF00343 d1em6a  PF00343 0   1637  TRUE
d1qm5a  PF00343 d1qm5a  PF00343 0   1623  TRUE
d2mysa2 PF00063 d2mysa2 PF00063 0   1603  TRUE
g1gk9.1 PF01804 g1gk9.1 PF01804 0   1585  TRUE
d1b7ta4 PF00063 d1b7ta4 PF00063 0   1584  FALSE",header=TRUE,stringsAsFactors=FALSE)

# draw an ROC curve
library(ROCR)
pred <- prediction(tab$bitScore, tab$hit)
perf <- performance(pred, "tpr", "fpr")
plot(perf, main = "ROC curve", colorize = TRUE)

# calculate the AUC
auc <- performance(pred, "auc")

#total number of FALSE
total_false <- length(tab$hit[tab$hit==FALSE])

#fpr to get to 50 FP's
fpr.stop <- min(50/total_false,1)

#what is the AUC at the 50th FALSE
AUC50 <- performance(pred, "auc", fpr.stop=fpr.stop)
ADD COMMENT
0
Entering edit mode

Thank you so much for this. I want to draw this kind of plot for my sequences at different ROC50 scores

proportion of sequences vs ROC50 scores

I am not getting how will I get different ROC50 scores.

ADD REPLY
0
Entering edit mode

that plot would not make sense for an individual sequences to generate ROC50's. it would make sense for a "class" of sequences. maybe search the literature for some additional examples. i wonder what is going on with Hmmer.

ADD REPLY
0
Entering edit mode

As far as I understood, they are plotting the proportion of sequences upto they found 50 false positives, like when they found ist false positive they are having the ROC50 score at that point and they goes upto 50 like this.

ADD REPLY

Login before adding your answer.

Traffic: 1475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6