Question

Programmatically retrieving positions of protein active site residues

3

Entering edit mode

16 days ago

Mariana ▴ 40

Hello,

I have a database of hydrolases (in particular, plastic degrading enzymes) and i would like to rerieve the positions of the residues in the catalytic triad (that are the active site residues annotated on UnipPot) for each enzyme. Is there a way I could retrieve those residues and respective positions in the sequence, programatically?

I have seen other similar posts but a lot of the tools and websites suggested are down.

Thank you in advance!

Uniprot PDB Proteins • 584 views

ADD COMMENT • link updated 15 days ago by me ▴ 760 • written 16 days ago by Mariana ▴ 40

1

Entering edit mode

"I have seen other similar posts but a lot of the tools and websites suggested are down."

Please show & share your research as part of your post. Otherwise, without specifics we don't know if you missed the best one and are forced to redo research you probably already did. Always endeavor to help those you want to help you.
For example, I indeed see that the one associated with Kirshner et al 2013 'Catalytic site identification—a web server to identify catalytic site structural matches throughout PDB' is not responding.

Have you tried GASS-WEB: a web server for identifying enzyme active sites based on genetic algorithms that is related to presumably Izidoro et al. 'GASS: identifying enzyme active sites with genetic algorithms' and Moraes et al. 'GASS-WEB: a web server for identifying enzyme active sites based on genetic algorithms'?

Also keep in mind that the PDB files of many structures are annotated with catalytic residues and residues involved in biological roles. See the Proteopedia 'Site' page where it gives an example for 1eve:

"For example, at 1eve, below the molecule, are green links to highlight a catalytic site, and an inhibitor binding site. These two sites were abbreviated in the atomic coordinate file as CAT and IHB, respectively"

Interestingly, I was introduced to Python as an excellent tool for structural biology analysis in a minicourse where we had to write a script that would scan a large collection of PDB files for a catalytic triad of specific residues that fell in a specific distance of one another. In fact, I point this out in related reply here. There may be a couple of additional leads there. And now I've updated the next section to add exploring that idea further.

Following up on that Python script to scan for catalytic triads:

Searched at GitHub with the following terms:
'pdb catalytic triad'
Got this repo in the results:
https://github.com/aretasg/catalytic-triad-detection

Went to that 'catalytic-triad-detection' repo and saw it says "Detect catalytic triads in serine protease PDB files based on geometry" and that it looks nice with a detailed README and a provided Python script and example input data.

Went to here and clicked on 'launch binder' to get a fresh, temporary session in JupyterLab. When the session came up, I used the launcher to make a terminal window. In that terminal window I ran the following commands to clone the catalytic-triad-detection repo and run the script on any PDB files present as suggested by the README:

git clone https://github.com/aretasg/catalytic-triad-detection.git
cd catalytic-triad-detection/
for f in *.pdb;do python find_triad.py $f;done

Saw the result for the only PDF file presently provided:

jovyan@jupyter-binder-2dexamples-2drequirements-2djxqt1qa8:~/catalytic-triad-detection$ for f in *.pdb;do python find_triad.py $f;done
The triad atoms of 1agj_A.pdb chain A are: 1539-586-953 The triad residue numbers are: 195-72-120

Any way to check the script is doing what is meant to do correctly? I noted by double clicking on it to open it in the Jupyter session that the 1agj_A.pdb file used in the example is just the atom coordinates for the structure. (In other words, it is a simplified version of the data that doesn't include a proper header that most RCSB/PDB entries will have.) So that particular file doesn't include any additional information to help assess.

Went to the RCSB structure entry for PDB id code 1agj. Clicked on 'Display Files' drop-down in upper right and selected 'PDB Format (Header)' to get this header.

In the header, I see for REMARK 800, which is for the SITE records in a PDB file as spelled out here under 'REMARK 800 (updated)' and Proteopedia's 'Site' record page, with the corresponding SITE details appearing down in the header in the SITE section:

REMARK 800 SITE                                                                 
REMARK 800 SITE_IDENTIFIER: SNA                                                 
REMARK 800 EVIDENCE_CODE: UNKNOWN                                               
REMARK 800 SITE_DESCRIPTION: CATALYTIC TETRAD OF SERINE PROTEASE.               
REMARK 800                                                                      
REMARK 800 SITE_IDENTIFIER: SNB                                                 
REMARK 800 EVIDENCE_CODE: UNKNOWN                                               
REMARK 800 SITE_DESCRIPTION: CATALYTIC TETRAD OF SERINE PROTEASE.      


....


SITE     1 SNA  4 HIS A  72  ASP A 120  SER A 195  SER A 211                    
SITE     1 SNB  4 HIS B  72  ASP B 120  SER B 195  SER B 211

That seems in good agreement with the script result of:

"The triad residue numbers are: 195-72-120"

Further validation the script is identifying the correct residues:
Uniprot entry P09331 comes up as the sole result if you enter the PDB code 1agj at the Uniprot site.
Because visually inspecting the sequence at the PDB in comparison to Uniprot entry P09331 or using PDBrenum with 1agj, see here reveals the PDB entry 1agj corresponds to starting at the 39th position of the sequence, you can get the corresponding numbers for the full length protein sequence by adding 38 to the PDB residue values. So positions 233, 110, & 158 if you consider the full-length version of the protein experimentally solved in 1agj. And those position numbers are what Uniprot entry P09331 lists under 'Features: Showing features for active site'. (Plus, see my use of the AI generated code in my comment replying to GenoMax's suggestion that gives those numbers from a programmatic query of Uniprot.)

Related GitHub options:
If relax search to 'pdb catalytic' also see:

Example jupyter notebook to calculate Rosetta Constraints between catalytic residues given a pdb structure

Or with 'pdb active site':

see Enzyme-Active-Sites-Extractor

The point is there are quite a few adaptable approaches out there.

ADD REPLY • link 16 days ago by Wayne ★ 2.0k

0

Entering edit mode

While you wait to get answers, consider asking ChatGPT for a solution. You may be pleasantly surprised with pointers/code you will get.

ADD REPLY • link 16 days ago by GenoMax 142k

1

Entering edit mode

Wow, I was shocked at this AI answer that looks to be a good start for using UniProt. Sharing it in case it is useful. (It will work in sessions launched from here with no need to install requests because requests is already present.) NOT WRITTEN BY ME; it's AI generated. To get the following to work meaningfully, I replaced protein_id = 'P0AA25' with protein_id = 'P09331' and deleted if "CATALYTIC ACTIVITY" in line from the line catalytic_lines = [line for line in uniprot_text.split('\n') if "CATALYTIC ACTIVITY" in line]. That then gives: Catalytic residues: [('ACT_SITE', '110'), ('ACT_SITE', '158'), ('ACT_SITE', '233')]. Even though the code is only giving residue positions for now (the sequence is in the returned uniprot_data & should be parseable with the position information). You'll see the prompt is just a slight reworking of the original post:

PROMPT:
Aim: Programmatically retrieving positions of protein active site residues. Details: I have a database of hydrolases (in particular, plastic degrading enzymes) and i would like to rerieve the positions of the residues in the catalytic triad (that are the active site residues annotated on UnipPot) for each enzyme. Is there a way I could retrieve those residues and respective positions in the sequence, programatically?

ANSWER:

Absolutely, you can programmatically retrieve positions of protein active site residues, specifically the catalytic triad residues from hydrolases, using the UniProt database. The process involves querying the UniProt database, extracting the required data, and parsing it to get the positions and identities of the residues in the catalytic triad. Below, I will guide you through the steps using Python and the requests library to interact with the UniProt API.

Step 1: Setup Your Python Environment

First, you'll need Python installed on your machine along with the requests library, which we'll use to make HTTP requests. You can install the requests library using pip if you haven't already:

pip install requests

Step 2: Define Your Query

To retrieve data from UniProt, you can use their REST API. You will need to form a query that specifies you're interested in enzymes that are hydrolases and have annotations about the catalytic triad. Here's how you can construct such a query:

import requests

def fetch_uniprot_data(protein_id):
    url = f"https://www.uniprot.org/uniprot/{protein_id}.txt"
    response = requests.get(url)
    return response.text

def extract_catalytic_residues(uniprot_text):
    """
    Extracts catalytic residue data from UniProt text data.
    """
    catalytic_lines = [line for line in uniprot_text.split('\n') if "CATALYTIC ACTIVITY" in line]
    residues = []
    for line in catalytic_lines:
        # This is a simplified assumption and parsing logic should be adapted based on actual data examination
        parts = line.split(';')
        for part in parts:
            if 'ACT_SITE' in part:
                residue_info = part.split()
                residues.append((residue_info[1], residue_info[2].replace('.', '')))  # (position, residue)
    return residues

# Example usage:
protein_id = 'P0AA25'  # Example protein ID for a hydrolase
uniprot_data = fetch_uniprot_data(protein_id)
catalytic_residues = extract_catalytic_residues(uniprot_data)
print("Catalytic residues:", catalytic_residues)

Step 3: Parse the Data

In the extract_catalytic_residues function, you'll need to parse the text data returned by UniProt. Look for lines that mention "CATALYTIC ACTIVITY" or similar keywords that might indicate the presence of catalytic triad information. This will vary depending on the specific annotations UniProt uses for these enzymes.

Step 4: Automation for Multiple Enzymes

To scale this up and retrieve data for multiple enzymes, you can iterate over a list of UniProt IDs:

protein_ids = ['P0AA25', 'Q5SMY1', 'P0AA26']  # List of UniProt IDs for different hydrolases

all_catalytic_residues = {}
for pid in protein_ids:
    uniprot_data = fetch_uniprot_data(pid)
    catalytic_residues = extract_catalytic_residues(uniprot_data)
    all_catalytic_residues[pid] = catalytic_residues

print("Catalytic residues in multiple proteins:", all_catalytic_residues)

Step 5: Refining and Validation

Make sure to validate the accuracy of your parsed data:

Check a few entries manually by comparing with the UniProt website.
Adjust the parsing logic in extract_catalytic_residues as necessary, depending on the structure and consistency of the data in the UniProt entries.

Conclusion

This basic script provides a framework to start programmatically retrieving information on catalytic residues of hydrolase enzymes from UniProt. Depending on the completeness and consistency of UniProt annotations, you may need to adjust your parsing logic to ensure accuracy and comprehensiveness.

ADD REPLY • link 16 days ago by Wayne ★ 2.0k

score 0 · Answer 1 · 2024-04-29

Although this probably doesn't qualify as programmatically retrieving positions, you can always do it by sequence alignments. This can be done two ways: 1) build a multiple sequence alignment yourself and inspect the alignment for catalytic triad positions; 2) use a hidden Markov model (or several of them) of your enzyme to align the sequences and then look for conserved residues. The latter option is much better but it assumes that you have a HMM.

score 0 · Answer 2 · 2024-04-30

This can be done via SPARQL at the UniProt.org sparql endpoint

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein ?enzyme ?activeSitePosition
WHERE {
  #Enzymes that are hydrolase
  ?enzyme rdfs:subClassOf <http://purl.uniprot.org/enzyme/3.-.-.-> .
  #UniProt entries with an EC classification
  ?protein up:enzyme|up:annotation/up:catalyticActivity/up:enzymeClass ?enzyme ;
       up:annotation ?activeSiteAnnotation .
 ?activeSiteAnnotation a up:Active_Site_Annotation ;
                       up:range ?activeSiteAnnotationRange .
 ?activeSiteAnnotationRange faldo:begin/faldo:position ?activeSitePosition .
}

However, this query not select for active site triads. Nor for only plastics, and the active sites might not be associated with the hydrolase activity as in for example HIS4_ARATH

The plastics could be found with the structured catalytics activities tag if you could define 'plastics' in terms of Rhea and ChEBI. Triads can be selected with a GROUP BY ?protein ?enzyme and HAVING(COUNT(?activeSitePosition) = 3) added at the end of the query. This triads may have false positives/negatives due this multifunction enzymes.