Hi everyone,
I have a question regarding the ROC50 calculation for protein remote homology detection.
I have gone through different papers
"Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching"
"https://www.nature.com/articles/srep32333#Sec6"
"https://www.pnas.org/doi/10.1073/pnas.0308067101#sec-1"
"https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1842-2"
I have done homology search of my queries and I have the table (almost 50000 hits)
Query   Query_Fam   Target  Target_Fam  eValue  bitScore
d1i50a  PF04997 d1i50a  PF04997 0   2889
d1i50b  PF04565 d1i50b  PF04565 0   2420
d1i6vd  PF00623 d1i6vd  PF00623 0   2327
d1i6vc  PF04563 d1i6vc  PF04563 0   2194
d1htya  PF09261 d1htya  PF09261 0   2098
d1eula  PF00689 d1eula  PF00689 0   1974
d1qbkb  PF03810 d1qbkb  PF03810 0   1801
d1ygpa  PF00343 d1ygpa  PF00343 0   1774
d1ceza  PF14700 d1ceza  PF14700 0   1774
d1fiy   PF00311 d1fiy   PF00311 0   1749
d1qgra  PF13513 d1qgra  PF13513 0   1730
d2btva  PF01700 d2btva  PF01700 0   1693
d1a8i   PF00343 d1a8i   PF00343 0   1683
d1em6a  PF00343 d1em6a  PF00343 0   1637
d1qm5a  PF00343 d1qm5a  PF00343 0   1623
d2mysa2 PF00063 d2mysa2 PF00063 0   1603
g1gk9.1 PF01804 g1gk9.1 PF01804 0   1585
d1b7ta4 PF00063 d1b7ta4 PF00063 0   1584
The true positives and False positives labels will be based on the protein families to which the Query and Target proteins belong to.
I am not getting that how they are plotting the "Proportion of protein with given performance vs ROC50 values". Please let me know, If anyone aware of this problem and how to do it in R .
Thank you so much
I think you'd first need to know which are true positives and false positives
I will label a protein (Target) as a true positive if it belongs to the same protein family (Query), as indicated in the 'Query Fam' and 'Target Fam' columns. If the proteins do not belong to the same family, I will label them as false positives."