Finding Pdb File For Specific Transcript
1
1
Entering edit mode
10.1 years ago

I would like to find the PDB file (or whatever 3d structure file for proteins that is appropriate) for a specific transcript. I had several problems with that. For example when using ensembl/biomart. When querying its database, lets say for the transcript BRCA1-001, i get several proteins associated with it.

I'm a computer scientist, so excuse me if i misunderstand some things. But shouldn't one transcript result in exactly one protein? I guess there can be post translation changes, but i would have assumed that there would be a way to associate a transcript with its "standard" protein.

So what i'm searching for is, idealy, a webservice that gives me the protein ID when given a transcript id (be it an ensembl transcript id or some other type of id).

Thank you for your insights :)

genomics protein database • 3.8k views
ADD COMMENT
5
Entering edit mode
10.1 years ago
Emily 23k

This is to do with the way that BioMart works: it queries the gene database, not the transcript database. So when you put in a transcript ID, it looks for the gene that transcript ID is associated with. It then reports back all the proteins that gene has linked to it. The way to get around this is always ask for your input as well as your output. If you put in a list of ENST transcript IDs and you want the ENSP IDs, don't just select ENSP IDs as output, select the ENST IDs too. You will pull out far more than you actually put in and have to parse it a bit, but at least you will be able to identify which is which.

If you're working one at a time, the easiest thing is to avoid BioMart and just search for the transcript ID directly in Ensembl. You'll find a table of the transcripts for that gene, with the one-to-one relationship between the proteins and transcripts shown. You may need to click on "Show transcript table" to see it though.

If you have quite a lot, you can just use our Perl API. You can write a very simple script that pulls out a transcript object by ID, finds the translation object associated with that transcript, and that transcript only, then prints whatever ID type you like to go with it.

ADD COMMENT
0
Entering edit mode

Thank you for your answer, but i dont seem to be able to reproduce what you describe. When i search for the protein corresponding to a transcript, everything works fine: http://www.ensembl.org/biomart/martview/8ab732ed264b7109b9576b84b32be173?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id&FILTERS=hsapiens_gene_ensembl.default.filters.ensembl_transcript_id."ENST00000357654"&VISIBLEPANEL=resultspanel

But when i try to add the PDB id to that, i get more than one result. Again, this might very well be me not understanding some parts of the genetics involved, but from what i understand there should be one PDB "id" per transcript, as there is one protein transcribed by one transcript. here is the request i'm using: http://www.ensembl.org/biomart/martview/8ab732ed264b7109b9576b84b32be173?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.pdb&FILTERS=hsapiens_gene_ensembl.default.filters.ensembl_transcript_id."ENST00000357654"&VISIBLEPANEL=resultspanel

The idea is that i can give the user of my application a 3d model of the protein of the transcript he is currently visualizting. There might be multiple 3 models associated, but from what i saw, the same PDB id seems to be associated with multiple transcripts. An example of this can be seen here: http://www.ensembl.org/biomart/martview/8ab732ed264b7109b9576b84b32be173?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.pdb|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id&FILTERS=hsapiens_gene_ensembl.default.filters.ensembl_gene_id."ENSG00000012048"&VISIBLEPANEL=resultspanel You quickly notice the same PDB id on multiple transcripts. thank you for your help :)

ADD REPLY
0
Entering edit mode

This is a slightly annoying case of xref mapping. Our PDB IDs come in via Uniprot. Uniprot's databasing is such that proteins are organised by gene, not as individual transcripts. This means that instead of having direct mapping between an Ensembl transcript and the Uniprot protein it encodes, we have an Ensembl gene which links to all the transcripts and all the proteins with no indication of which is which.

Uniparc seem to have direct mapping, so this is something you can use. Alternatively you can narrow down your options by matching up protein lengths between ENSPs and Uniprot/PBD IDs.

ADD REPLY
0
Entering edit mode

Thank you. From your explanation it seems that it is not possible yet to have that connection (transcript -> protein 3d structure) as of today. At least i could not yet find a way to go from the Uniparc id to a pdb id, and the solution to use the protein lengths seems too "hacky". I hope this is something that can be solved in the future, i feel like ensembl is in the perfect position to provide such a service (transcript -> 3d protein structure).

ADD REPLY
0
Entering edit mode

Hey, I know this is an old question, but have you found a way to map transcripts to PDBs yet?

ADD REPLY
0
Entering edit mode

The mapping is fixed now. But it's worth noting that since PDB will have multiple structures per protein (monomer only, dimer, dimer bound to the ligand, dimer bound to the other ligand, dimer bound to both ligands at once etc etc etc) there are still often many PDB IDs for one transcript/protein.

ADD REPLY
0
Entering edit mode

Thank you for the reply, but I don't seem to find the mapping anywhere on the Ensembl's ftp site. Is it accessible? The one-to-many mapping is not a problem with what I am dealing currently.

ADD REPLY
0
Entering edit mode

There's no xref mapping files on the FTP site. These are stored in the MySQL tables.

/usr/local/mysql/bin/mysql -h ensembldb.ensembl.org -u anonymous -P 5306
use homo_sapiens_core_89_38;
select translation.stable_id, xref.display_label from xref, object_xref, translation, external_db whf.xref_id=object_xref.xref_id and object_xref.ensembl_id=translation.translation_id and external_db.external_db_id=xref.external_db_id and external_db.db_display_name like "%pdb%";
ADD REPLY
0
Entering edit mode

surprising to me how hard it can be to translate the 'genome' world to the 'protein' world...the transcript->pdb provided by biomart is a good step here. would be interesting if alphafold type structures could be included also

ADD REPLY

Login before adding your answer.

Traffic: 2469 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6