Merging Data From Pfam Pdb And Uniprot
10
6
Entering edit mode
10.3 years ago
Aurobhima ▴ 100

Hi,

Does anyone know of a method to map between Pfam, PDB and UniProt. I have very specific criteria I want to select data with, and this requires a combination of these three databases.

I have been working on a solution for some time now on my own, but would like to know if anyone else has been doing something like this and if they'd be interested in discussing this with me.

Thanks

pdb uniprot • 6.6k views
8
Entering edit mode
10.3 years ago

Residue-level cross reference data based on PDB is available via SIFTS annotations.

Please check the following files at SIFTS Quick Access:

pdb_chain_uniprot.lst - A summary of the PDBe to UniProt residue level mapping, showing the start and end residues of the mapping using SEQRES, PDB sequence and UniProt numbering.

pdb_chain_pfam.lst - A summary of the Pfam domain identifier(s)(derived via the UniProt mapping) for each PDB chain that has been processed.

You can use two files and use one identifier to map to others. This is the best cross-reference for PDB-Uniprot-Pfam I could find. I am using this in my analysis.

2
Entering edit mode

What kind of issues ?

0
Entering edit mode

Thanks.. we did try it before and found that there are some issues with it.. which is why we went our own way.. but it is the closest I've seen to what I'm looking for..

4
Entering edit mode
10.3 years ago

Another answer for fun, using bio2rdf :-)

from http://uniprot.bio2rdf.org/sparql use the following query

select ?id ?pdb ?pfam  where {
?s <http://purl.org/dc/elements/1.1/identifier> ?id .
?s a <http://bio2rdf.org/core:Protein> .
?s  <http://www.w3.org/2000/01/rdf-schema#seeAlso>  ?pdb .
?s  <http://www.w3.org/2000/01/rdf-schema#seeAlso>  ?pfam .
FILTER regex(?pdb, "pdb:")
FILTER regex(?pfam, "pfam:")

} limit 100 ##remove this for a larger answer

id  pdb     pfam
uniprot:P13744  http://bio2rdf.org/pdb:2E9Q     http://bio2rdf.org/pfam:PF00190
uniprot:P13744  http://bio2rdf.org/pdb:2EVX     http://bio2rdf.org/pfam:PF00190
uniprot:Q8GBW6  http://bio2rdf.org/pdb:1ON3     http://bio2rdf.org/pfam:PF01039
uniprot:Q8GBW6  http://bio2rdf.org/pdb:1ON9     http://bio2rdf.org/pfam:PF01039
uniprot:Q10666  http://bio2rdf.org/pdb:3C2G     http://bio2rdf.org/pfam:PF00505
uniprot:Q9FK25  http://bio2rdf.org/pdb:1NII     http://bio2rdf.org/pfam:PF08100
uniprot:Q9FK25  http://bio2rdf.org/pdb:1NII     http://bio2rdf.org/pfam:PF00891
uniprot:P31946  http://bio2rdf.org/pdb:2BQ0     http://bio2rdf.org/pfam:PF00244
uniprot:P31946  http://bio2rdf.org/pdb:2C23     http://bio2rdf.org/pfam:PF00244
uniprot:Q12802  http://bio2rdf.org/pdb:2DRN     http://bio2rdf.org/pfam:PF00169
(...)

1
Entering edit mode

bio2rdf uniprot data is unfortunatly very much out of date :(

2
Entering edit mode
10.3 years ago

It's all there on UniProt in "Cross-references", e.g. see this entry for NMB1681. The data is also available in the export formats, e.g. text format.

4
Entering edit mode

Example of an inconsistency?

0
Entering edit mode

I have these data, but there are inconsistencies in the cross references between the 3 databases.. I wish it were that straightforward..

1
Entering edit mode
10.3 years ago

You might want to have a look at our BridgeDB, which was developed to help you solve questions like this. See: http://www.bridgedb.org

1
Entering edit mode

thanks I'll have a look into it.. it could be useful..

0
Entering edit mode
10.3 years ago

The file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz seems to contain all the IDs.

curl  -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz" |\
gunzip -c |\
egrep -i '(accession|pfam|pdb)'

(...)
<accession>P0C9E9</accession>
<accession>P0C9K3</accession>
<dbReference id="PF01639" key="10" type="Pfam">
<accession>P0C9I4</accession>
<dbReference id="PF01639" key="10" type="Pfam">
(...)
<property type="PDB accession" value="1KMH"/>
(...)

0
Entering edit mode

there is also similar data in the Pfam, and there is the UniProt ID in the header of PDB files.. but they don't play nice with each other.

0
Entering edit mode
10.3 years ago
Nabellaleen ▴ 10

There is also modern solution using data crossing softwares (ie : http://www.isoft.fr/bio/biopack_data_en.htm ). It definitly fills me with despair to see people "reinvent the wheel" for the main but only first step of their work : data access and mining ...

0
Entering edit mode

Thanks.. I'll have a look.. not sure I'm re-inventing the wheel though.. I have yet to find something that comes close to what it is I'm trying to do.. I need to make very specific selection criteria, e.g. all Pfam domains which are only present in non-membrane mitochondria proteins. Or which protein structures can be found exclusively extra-cellular in Eukaryotes.. if I'm reinventing the wheel, I'd be really happy to use the existing one.. :-)

0
Entering edit mode

It sounds like you should be able to build a query to answer that using SRS. Or at most a couple of queries!

0
Entering edit mode

In fact, it exists softwares which permit to easily import, read, parse, filter and cross data with total control on all parameters. So, this type of software permit to make a pipeline for your needs or for a lot of other needs in some days. And when I say "reinvent the wheel" it's not about your specific analysis but about re-designing of script each time with only some minor changes but with a large time-cost :)

0
Entering edit mode
10.3 years ago
Jerven ▴ 650

Using uniprot.org Using customize display in the uniprot entry view

Or using a mapping service http://www.uniprot.org/uniprot/?tab=mapping.

If you want to discuss the way uniprot maps to PDBe (not so straight forward as you might think) contact help@uniprot.org. Pfam comes directly out of the interpro results and there should not be that much skew between these databases.

0
Entering edit mode
10.3 years ago
Iain ▴ 260

You could try using the SRS service in the EBI.

http://srs.ebi.ac.uk/

This service links many databases with each other.

There is a tutorial available: http://www.embl.de/~seqanal/courses/srscourse/srstut.html

An example taken directly from this tutorial, the query: enzyme < pdb gives all the enzyme database entries for which the 3D structure is known!

0
Entering edit mode
10.3 years ago

I am planning to use a hash/checksum of the protein sequences to cross-link Uniprot to others.

SEquence Globally Unique IDentifier (SEGUID) is a hashing standard (based on SHA1) - it was specifically developed for uniquely identifying protein sequences.

0
Entering edit mode

The sequencing cross referencing tool at the EBI might save you some time. http://www.ebi.ac.uk/Tools/picr/

0
Entering edit mode

Get in touch and let's see if we can merger my approach with yours, I think your idea has real potential.

0
Entering edit mode
5.5 years ago

This is a very old thread, however I still have not found a good way for pdb to uniprot residue mapping that doesn't rely on a web server that may not be up to date. SIFTS may be the way to go, but has a complicated data structure. Below is a simple self-contained biopython function which relies on an on-the-fly sequence alignment to determine the residue mapping. There may be more elegant ways to script this, but the following works.

from Bio.PDB import *
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.PDBList import PDBList
from Bio import pairwise2
from Bio import SeqIO

def resmap(chain, uniprot_sequence):
# Returns a PDB to UniProt residue number dictionary.

ppb=PPBuilder()
polypeptides = ppb.build_peptides(chain)
pdb_sequence = ""
for polypeptide in polypeptides:
pdb_sequence = pdb_sequence + polypeptide.get_sequence()
pdb_res_nums = sortedres.id[1] for res in chain if res.id[0] == " ")

residue_list = Selection.unfold_entities(chain, 'R')
alignments = pairwise2.align.globalms(uniprot_sequence, pdb_sequence, 2, -1, -.5, -.1)
uniprot_align = str(alignments[0][0])
pdb_align     = str(alignments[0][1])

uniprot_map = []
count = 0
for residue in uniprot_align:
if residue != "-":
count += 1
uniprot_map.append(count)
else:
uniprot_map.append(-1)

pdb_map = []
count = -1
for residue in pdb_align:
if residue != "-":
count += 1
pdb_map.append(pdb_res_nums[count])
else:
pdb_map.append(-1)

matches = []
for index, residue in enumerate(uniprot_map):
if uniprot_align[index] == pdb_align[index] and uniprot_align[index] != "-" and pdb_align[index] != "-":
matches.append(True)
else:
matches.append(False)

mapping = {}
for index, match in enumerate(matches):
if match:
mapping[pdb_map[index]] = uniprot_map[index]

return mapping