For virtual screning, finding bioactivity data for the protein targets with no bioactivity data available?
Entering edit mode
3.0 years ago
entropy ▴ 50


Using Chembl bioactivity data (mainly with SMILES), I trained my Deep Learning model for hit detection of my target proteins via virtual screning. However, Chembl does not have compound bioactivity data for all the proteins. I have some proteins of interest that is not available either in Chembl or in PubChem.

I need some bioactivity data for those proteins to fine tune my DL model. Is there a way that I can still use my DL model and predict hits for those proteins using the model? Essentially, can you suggest me how can I get, alternative, bioactivity data for those proteins? I was thinking to use bioactivity data of similar proteins to my target proteins of interest. If this is the right approach, can you suggest me tutorials that demonstrates protein similarity search? If possible programatically?


compound protein drug target hit detection ML • 874 views
Entering edit mode
3.0 years ago
matteoferla ▴ 30

Okay, I'll split my answer in two and you are asking two things. (And I am assuming that by "all protein" you mean all the protein you care about. Not, the whole human proteome say...)

Find homologues

To find homologues of a protein BLAST is the best tool. It is a GUI, but there is a command line version to run locally assuming you download all the protein (a lot) or a Restful API, which has many wrappers including in Biopython. Proteins diverge most on the surface so similarity above 70% is still okay for your hits, but lower might be iffy. Lower than 20% and short match is likely junk. Proteins are divided into families and PFam ID are a great resource and appear in uniprot entries for protein, which is one of the best general use DB for protein.

Alt data

if not and Chembl or in PubChem are working for you, you might need get creative.

When you say you did a virtual screen, did you dock a library and potential cross-validate the scores? If not and instead you split up your molecules by functional group and properties (QSAR kind of thing) and used that data to train a model to predict the bioactivity outcome when available. In which, case, in the case of your target protein where you don't have bioactivity data, you could do docking to see if the data correlates. To do that you might need to make a structural model of your protein if a 3D structure is not available in PDB. You might find a threaded model in Expasy Swissmodel, but if not then using an model generated with ITasser or Phyre (Threaded/ab initio) will be no good for docking.

Alternatively and totally a long shot... if your protein is an enzyme and you just need a few targets, BRENDA is a good resource to find the parameters of the native metabolites and promiscuous activities —which can be used a positive controls for any virtual screen, although metabolites aren't very drug like (sensu Lipinski's rule of five). That is, whereas Uniprot or MetaCyc and other sites just give the physiological substrate, BRENDA will tells you about similar compounds that still bind, but badly.

Entering edit mode

Thank you very much for your kind and detailed answer.

I think, my question was not detailed enough as your second answer is quite different than I expect. Let me explain in more detail.

My main goal is to find hit compounds, from Chembl or similar other databases like ZINC , which binds to my target protein of human. The protein does not have any bioactivity data in Chembl or anywhere.

In order to virtually screen all the Chembl and ZINC compounds for my target protein, I need to first train my model with some bioactivity/binding affinity data.

My main question is, what is the best way to get bioactivity data for this target protein. Is protein similarity search for my target protein is a reasonable approach (assuming that some of the proteins have bioactivity data in Chembl and I can use them to train my model). If so, how can I find proteins similar to my target protein? Your suggestion of the BLAST web site for protein similarity search is helpful but when I tried the example sequence in the web site, it did not seem to be easily appearant to get what I want. I am not sure how can I get the proteins from the results shown in the table. Do you know any tutorial that describes this process or can you also shortly describe the process as well? Is the "Percent Identity" column is the one you mentioned for "similarity above 70%"?

Entering edit mode

Sorry for delay. Yes, in the absence of any data (few proteins have), a protein similarity search is the best fallback approach. Keeping an eye out for sequence divergence (yes % identity is the column I meat), especially the active site residues (even with a model made as mentioned above). If you get many close homologues with bioactivity data and differences in activity of a given compound would indicate differences in the binding pocket (which can be checked by looking at the structural model).

There are loads of tutorials online on how to use Blast and I was slow at replying here, so I assume that the using Blast hurdle is cleared.

It is a reasonable approach to get data for your task, but not having any data to start with is far from ideal as the results will be compounds with broad activities. Say you opt for using the non-conflicting bioactivity data of the homologues, you'd risk uncovering the obvious, say beta-chloroalanine inhibits PLP-dependent enzymes or GTPγS messes with G-proteins... Note that the lack of data would be fine for a (normal) force-field based virtual screen (i.e. with structures), although the starting point there is most often robotic fragment-based screening or synthons of the wild type ligand.


Login before adding your answer.

Traffic: 1836 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6