Mapping Proteinids To Protein Cluster Ids
1
0
Entering edit mode
12.5 years ago

Is there any ID mapping option (service/FTP)from where I can map proteinIDs(PID)/accession of a given set of genomes to their corresponding protein clusters .I tried to use the uniprot ID mapping interface which provides option of converting accessions to blastclustDB,However surprisingly the reference genome on which I'm working does not have uniport accessions,even after 2 years of it's release at NCBI. Since protein clusters are NCBI's Entrez service therefore I assume there should be linkage of proteinIDs to the protein cluster,which I'm not able to locate.

Example of the Protein Id types which could be used to assign protein clusters YP_003251185.1 or GI:261417503

identifiers conversion • 4.3k views
ADD COMMENT
1
Entering edit mode

By 'blastclustDB' do you mean Entrez Protein Clusters (ProtClustDB): http://www.ncbi.nlm.nih.gov/proteinclusters

ADD REPLY
0
Entering edit mode

If you do mean Entrez protein clusters - that's an experimental NCBI service, not updated since 2010. I would not recommend using it.

ADD REPLY
0
Entering edit mode

Also, you do not need UniProt accessions to use the UniProt ID mapping service. It accepts multiple kinds of identifier, including GIs.

ADD REPLY
0
Entering edit mode

@Hamish,Yes it is proteinclusters at NCBI to which Uniprot mapping service refer as Blastclust.

ADD REPLY
0
Entering edit mode

@neilfws:Uniprot ID mapping does not let me opt for GI to BlastclustDB coversion,if chosen so then it automatically changes to uniprotKBAC/ID

ADD REPLY
0
Entering edit mode

@Robert Checking at UniProt I cannot find any mention of "blastclustdb", however looking for ProtClustDB finds the dbxref entry along with the News announcement detailing the addition of ProtClustDB (http://www.uniprot.org/news/2010/03/02/release) and the Identifier mapping service documentation detailing the names for use with the ID Mapping web service (http://www.uniprot.org/faq/28#id_mapping_examples). So in the interests of tracking this reference down so UniProt can correct it, where are you seeing "blastclustDB"?

ADD REPLY
0
Entering edit mode

ProtClustDB only contains clustering data for selected RefSeq proteins, so it is entirely possible that your proteins are not present in the database. Please edit your question to provide sample protein_ids and the identifier(s) for the reference genome so we can verify that is the case and suggest an appropriate tactic to map the proteins.

ADD REPLY
0
Entering edit mode

@Hamish sorry that was a typo,it is ProtClustDB indeed and I'm putting up the examples of protein Ids as an additional edit in the original question.Thanks

ADD REPLY
0
Entering edit mode
12.5 years ago
Hamish ★ 3.2k

Looking at your example YP_003251185.1, it is a provisional RefSeq and is in effect a direct clone of ACX76703 from the INSDC databases (DDBJ, EMBL-Bank & GenBank). Since UniProtKB uses EMBL-Bank as a primary data source (in the form of UniProtKB/TrEMBL; TrEMBL => translated EMBL-Bank), using the INSDC 'protein_id' when searching UniProtKB is more robust when there are possible synchronization issues. In this case a search in UniProtKB with ACX76703 finds C9RXR9. Checking the cross-references, this entry has the expected cross-reference to RefSeq YP_003251185, and does not contain a cross-reference to ProtClustDB.

Going back and looking at the RefSeq entry, it gives me the NCBI Taxonomy ID of the source organism: Geobacillus sp. Y412MC61, taxon:544556. Since UniProt uses the same identifiers in the UniProt Taxonomy (NEWT), the main difference being that UniProt sometimes choses to use a different authority and thus a different species name, finding the organism in UniProt is a search for the taxonomy id in their Taxonomy, which gives NCBI_TaxID=544556. As expected the nomenclature used is slightly different: Geobacillus sp. (strain Y412MC61). The UniProt Taxonomy entry also tells us that UniProtKB contains a complete proteome for this organism.

Since the protein sequences are available in UniProtKB, they will be clustered as part of the UniProt Reference Clusters (UniRef) databases.

Checking the "Related information" section of the right-hand side-bar for the RefSeq entry, there is a link to "Protein Clusters", which gives the corresponding entry in ProtClustDB: CLSK712430. Checking the the E-utilities documentation, Protein Clusters is available for searches. So to map from the RefSeq entry you can use ELink to get the identifiers (UID) of the related entries. For example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=proteinclusters&id=261417503

This gives the UID of the entry in ProtClustDB. Unfortunately since EFetch does not support ProtClustDB it is not possible to fetch the actual data, but ELink can be used again to get the UIDs of the member proteins of the cluster:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=proteinclusters&db=protein&id=712430

Alternatively you can have a look at the ProtClustDB data on the NCBI's FTP site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/), this contains information about each cluster including the nicer cluster identifiers used on the web interface.

However, as Neil has mentioned, this data has not been updated since 2010 and it was an experimental project looking into clustering methods. This is the likely reason why the UniProtKB entries are missing the expected ProtClustDB cross-reference, since in most cases UniProt depend on the database maintainer providing the cross-reference data to be included in the entries.

ADD COMMENT
1
Entering edit mode

@Robert See the E-utilities documentation (linked above) for details of the various limitations. In your case it sounds like your query is too large, try splitting it into smaller chunks and submitting multiple queries. If you still have problems check the documentation for the module, and submit a new question.

ADD REPLY
0
Entering edit mode

UniprotKB to Uniref does my purpose as I needed pre calculated protein clusters(IDs). Cheers!!

ADD REPLY
0
Entering edit mode

I also tried eutils by posting the GI'separated by comma but got the error(414) URL too large.Is there any way to do batch search of > 3000 GI

ADD REPLY
0
Entering edit mode

Hamish:The Uniprot to uniref ID mapping option works to get the protein clusters and that does the job to get the clusters but with limitation of getting protein clusters with 50,90 and 100 %identity. I'm also trying NCBI 'eutils'using "Bio::DB::EUtilities" module to map over 3000 Protein id but getting URI too large error? Please suggest whether I'm wrong some where or there is any restriction to fire limited query at eutil. If you feel that I should post this question as new thread then please let me know.Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6