Question: Merging Data From Pfam Pdb And Uniprot
6
gravatar for Aurobhima
7.1 years ago by
Aurobhima100
University of Birmingham
Aurobhima100 wrote:

Hi,

Does anyone know of a method to map between Pfam, PDB and UniProt. I have very specific criteria I want to select data with, and this requires a combination of these three databases.

I have been working on a solution for some time now on my own, but would like to know if anyone else has been doing something like this and if they'd be interested in discussing this with me.

Thanks

pdb uniprot • 4.8k views
ADD COMMENTlink modified 2.3 years ago by konrad.koehler0 • written 7.1 years ago by Aurobhima100
8
gravatar for Khader Shameer
7.1 years ago by
Manhattan, NY
Khader Shameer17k wrote:

Residue-level cross reference data based on PDB is available via SIFTS annotations.

Please check the following files at SIFTS Quick Access:

pdb_chain_uniprot.lst - A summary of the PDBe to UniProt residue level mapping, showing the start and end residues of the mapping using SEQRES, PDB sequence and UniProt numbering.

pdb_chain_pfam.lst - A summary of the Pfam domain identifier(s)(derived via the UniProt mapping) for each PDB chain that has been processed.

You can use two files and use one identifier to map to others. This is the best cross-reference for PDB-Uniprot-Pfam I could find. I am using this in my analysis.

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Khader Shameer17k
2

What kind of issues ?

ADD REPLYlink written 7.1 years ago by Khader Shameer17k

Thanks.. we did try it before and found that there are some issues with it.. which is why we went our own way.. but it is the closest I've seen to what I'm looking for..

ADD REPLYlink written 7.1 years ago by Aurobhima100
4
gravatar for Pierre Lindenbaum
7.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

Another answer for fun, using bio2rdf :-)

from http://uniprot.bio2rdf.org/sparql use the following query

select ?id ?pdb ?pfam  where {
?s <http://purl.org/dc/elements/1.1/identifier> ?id .
?s a <http://bio2rdf.org/core:Protein> .
?s  <http://www.w3.org/2000/01/rdf-schema#seeAlso>  ?pdb .  
?s  <http://www.w3.org/2000/01/rdf-schema#seeAlso>  ?pfam . 
FILTER regex(?pdb, "pdb:") 
FILTER regex(?pfam, "pfam:")

} limit 100 ##remove this for a larger answer

id  pdb     pfam
uniprot:P13744  http://bio2rdf.org/pdb:2E9Q     http://bio2rdf.org/pfam:PF00190
uniprot:P13744  http://bio2rdf.org/pdb:2EVX     http://bio2rdf.org/pfam:PF00190
uniprot:Q8GBW6  http://bio2rdf.org/pdb:1ON3     http://bio2rdf.org/pfam:PF01039
uniprot:Q8GBW6  http://bio2rdf.org/pdb:1ON9     http://bio2rdf.org/pfam:PF01039
uniprot:Q10666  http://bio2rdf.org/pdb:3C2G     http://bio2rdf.org/pfam:PF00505
uniprot:Q9FK25  http://bio2rdf.org/pdb:1NII     http://bio2rdf.org/pfam:PF08100
uniprot:Q9FK25  http://bio2rdf.org/pdb:1NII     http://bio2rdf.org/pfam:PF00891
uniprot:P31946  http://bio2rdf.org/pdb:2BQ0     http://bio2rdf.org/pfam:PF00244
uniprot:P31946  http://bio2rdf.org/pdb:2C23     http://bio2rdf.org/pfam:PF00244
uniprot:Q12802  http://bio2rdf.org/pdb:2DRN     http://bio2rdf.org/pfam:PF00169
(...)
ADD COMMENTlink written 7.1 years ago by Pierre Lindenbaum111k

bio2rdf uniprot data is unfortunatly very much out of date :(

ADD REPLYlink written 7.1 years ago by Jerven640
2
gravatar for Michael Kuhn
7.1 years ago by
Michael Kuhn4.9k
Dresden, Germany
Michael Kuhn4.9k wrote:

It's all there on UniProt in "Cross-references", e.g. see this entry for NMB1681. The data is also available in the export formats, e.g. text format.

ADD COMMENTlink written 7.1 years ago by Michael Kuhn4.9k
4

Example of an inconsistency?

ADD REPLYlink written 7.1 years ago by Neilfws48k

I have these data, but there are inconsistencies in the cross references between the 3 databases.. I wish it were that straightforward..

ADD REPLYlink written 7.1 years ago by Aurobhima100
1
gravatar for Chris Evelo
7.1 years ago by
Chris Evelo9.9k
Maastricht, The Netherlands
Chris Evelo9.9k wrote:

You might want to have a look at our BridgeDB, which was developed to help you solve questions like this. See: http://www.bridgedb.org

ADD COMMENTlink written 7.1 years ago by Chris Evelo9.9k
1

thanks I'll have a look into it.. it could be useful..

ADD REPLYlink written 7.1 years ago by Aurobhima100
0
gravatar for Pierre Lindenbaum
7.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

The file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz seems to contain all the IDs.

curl  -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz" |\
gunzip -c |\
egrep -i '(accession|pfam|pdb)'

  (...)
  <accession>P0C9E9</accession>
  <accession>P0C9K3</accession>
  <dbReference id="PF01639" key="10" type="Pfam">
  <accession>P0C9I4</accession>
  <dbReference id="PF01639" key="10" type="Pfam">
  (...)
  <property type="PDB accession" value="1KMH"/>
  (...)
ADD COMMENTlink written 7.1 years ago by Pierre Lindenbaum111k

there is also similar data in the Pfam, and there is the UniProt ID in the header of PDB files.. but they don't play nice with each other.

ADD REPLYlink written 7.1 years ago by Aurobhima100
0
gravatar for Nabellaleen
7.1 years ago by
Nabellaleen10
Paris, France
Nabellaleen10 wrote:

There is also modern solution using data crossing softwares (ie : http://www.isoft.fr/bio/biopack_data_en.htm ). It definitly fills me with despair to see people "reinvent the wheel" for the main but only first step of their work : data access and mining ...

ADD COMMENTlink written 7.1 years ago by Nabellaleen10

Thanks.. I'll have a look.. not sure I'm re-inventing the wheel though.. I have yet to find something that comes close to what it is I'm trying to do.. I need to make very specific selection criteria, e.g. all Pfam domains which are only present in non-membrane mitochondria proteins. Or which protein structures can be found exclusively extra-cellular in Eukaryotes.. if I'm reinventing the wheel, I'd be really happy to use the existing one.. :-)

ADD REPLYlink written 7.1 years ago by Aurobhima100

It sounds like you should be able to build a query to answer that using SRS. Or at most a couple of queries!

ADD REPLYlink written 7.1 years ago by Iain260

In fact, it exists softwares which permit to easily import, read, parse, filter and cross data with total control on all parameters. So, this type of software permit to make a pipeline for your needs or for a lot of other needs in some days. And when I say "reinvent the wheel" it's not about your specific analysis but about re-designing of script each time with only some minor changes but with a large time-cost :)

ADD REPLYlink written 7.1 years ago by Nabellaleen10
0
gravatar for Jerven
7.1 years ago by
Jerven640
Jerven640 wrote:

Using uniprot.org Using customize display in the uniprot entry view

Or using a mapping service http://www.uniprot.org/uniprot/?tab=mapping.

If you want to discuss the way uniprot maps to PDBe (not so straight forward as you might think) contact help@uniprot.org. Pfam comes directly out of the interpro results and there should not be that much skew between these databases.

ADD COMMENTlink written 7.1 years ago by Jerven640
0
gravatar for Iain
7.1 years ago by
Iain260
Iain260 wrote:

You could try using the SRS service in the EBI.

http://srs.ebi.ac.uk/

This service links many databases with each other.

There is a tutorial available: http://www.embl.de/~seqanal/courses/srscourse/srstut.html

An example taken directly from this tutorial, the query: enzyme < pdb gives all the enzyme database entries for which the 3D structure is known!

ADD COMMENTlink written 7.1 years ago by Iain260
0
gravatar for Aleksandr Levchuk
7.1 years ago by
United States
Aleksandr Levchuk3.1k wrote:

I am planning to use a hash/checksum of the protein sequences to cross-link Uniprot to others.

SEquence Globally Unique IDentifier (SEGUID) is a hashing standard (based on SHA1) - it was specifically developed for uniquely identifying protein sequences.

See also: our PostgreSQL sequence-to-seguid implementation http://dba.stackexchange.com/questions/66/biological-sequences-of-uniprot-in-postgresql

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Aleksandr Levchuk3.1k

The sequencing cross referencing tool at the EBI might save you some time. http://www.ebi.ac.uk/Tools/picr/

ADD REPLYlink written 7.1 years ago by Iain260

Get in touch and let's see if we can merger my approach with yours, I think your idea has real potential.

ADD REPLYlink written 7.1 years ago by Aurobhima100
0
gravatar for konrad.koehler
2.3 years ago by
konrad.koehler0 wrote:

This is a very old thread, however I still have not found a good way for pdb to uniprot residue mapping that doesn't rely on a web server that may not be up to date. SIFTS may be the way to go, but has a complicated data structure. Below is a simple self-contained biopython function which relies on an on-the-fly sequence alignment to determine the residue mapping. There may be more elegant ways to script this, but the following works.

from Bio.PDB import *
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.PDBList import PDBList
from Bio import pairwise2
from Bio import SeqIO

def resmap(chain, uniprot_sequence):
# Returns a PDB to UniProt residue number dictionary. 

    ppb=PPBuilder()
    polypeptides = ppb.build_peptides(chain)
    pdb_sequence = ""
    for polypeptide in polypeptides:
        pdb_sequence = pdb_sequence + polypeptide.get_sequence()
    pdb_res_nums = sortedres.id[1] for res in chain if res.id[0] == " ")

    residue_list = Selection.unfold_entities(chain, 'R')
    alignments = pairwise2.align.globalms(uniprot_sequence, pdb_sequence, 2, -1, -.5, -.1)    
    uniprot_align = str(alignments[0][0])
    pdb_align     = str(alignments[0][1])

    uniprot_map = []
    count = 0
    for residue in uniprot_align:
        if residue != "-":
            count += 1
            uniprot_map.append(count)
        else:
            uniprot_map.append(-1)

    pdb_map = []
    count = -1
    for residue in pdb_align:
        if residue != "-":
            count += 1
            pdb_map.append(pdb_res_nums[count])
        else:
            pdb_map.append(-1)

    matches = []
    for index, residue in enumerate(uniprot_map):
        if uniprot_align[index] == pdb_align[index] and uniprot_align[index] != "-" and pdb_align[index] != "-":
            matches.append(True)
        else:
            matches.append(False)

    mapping = {}
    for index, match in enumerate(matches):
        if match:
            mapping[pdb_map[index]] = uniprot_map[index]

    return mapping
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by konrad.koehler0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 846 users visited in the last hour