Question

How to define gold standard protein interaction? How to construct such a gold standard dataset?

0

Entering edit mode

2.5 years ago

bhodai • 0

To my knowledge, protein interactions identified in experiments are usually compared with a reference dataset that is considered as gold standard. And that the gold standard interaction dataset is usually considered as containing true positive and true negative interactions. But I want to know how do we create such a gold standard dataset at the first place?

interaction protein • 1.3k views

ADD COMMENT • link updated 2.5 years ago by Mensur Dlakic ★ 27k • written 2.5 years ago by bhodai • 0

score 2 · Answer 1 · 2021-10-10

2

Entering edit mode

2.5 years ago

Mensur Dlakic ★ 27k

As I think is done in all areas of science, gold standard datasets are created by careful and repeated wet lab experiments on a small scale. In this case that means collecting literature data where various labs have proven interactions in a low-throughput fashion by using two hybrid, pulldowns, co-immunoprecipitations, genetic interactions, TAP-tagging, fluorescence co-localization, etc. When a given interaction is confirmed by multiple labs, using several different methods, and possibly in different organisms, it gets a golden standard status.

ADD COMMENT • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Hi Dr. Dlakic, Thank you for answering. I was thinking more about the process of rigorously defining the term 'gold standard dataset'. I am currently using machine learning algorithms to predict de novo interactions from PPI data and in every subfield of machine learning, there is a proper set of rules that is followed to get to the gold standard (e.g., In natural language processing, the British national corpus is considered a gold standard and they followed a protocol to create the dataset). Since the performance of the machine learning algorithm heavily depends on the quality of the data, I was trying to create the gold standard dataset for my project from scratch. That made me wonder what is the usual procedure to create such datasets for PPI data. I have been looking for relevant literature. I found this paper from 2010 :https://www.researchgate.net/publication/220173162_From_Experimental_Approaches_to_Computational_Techniques_A_Review_on_the_Prediction_of_Protein-Protein_Interactions . But I am still not so sure about how the process works.

ADD REPLY • link 2.5 years ago by bhodai • 0

0

Entering edit mode

My answer is the same after reading your added explanation.

I suggest you find a dataset that has already been used in quality publications, and test it with your own methodology. You are not the first to predict PPI and it would be helpful to others if your method can be compared to others using an existing dataset. If you come up with your own gold standard, it will be debatable whether your own performance is truly an improvement or simply a side-effect of a biased dataset you created.

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

So, you are suggesting that no set rules are followed by all manually curated gold standard datasets. And it's a process of trial and error?

ADD REPLY • link 2.5 years ago by bhodai • 0

0

Entering edit mode

I am suggesting no such thing. Please read carefully what I wrote, and also read through the papers that have described gold datasets creation.

https://pubmed.ncbi.nlm.nih.gov/14564010/

ADD REPLY • link 2.5 years ago by Mensur Dlakic ★ 27k