How to fetch protein uniport id by protein sequence string.
1
0
Entering edit mode
10.0 years ago
alsu.wh • 0

I have been looking to BioJava class that allows search to protein information such as its id, organism name and ect when entering the protein sequence (e.g. HELHYNILLCGNLCLPLQDFRAQIIKYVFMHSRKDINWMN). The class UniprotProxySequenceReader does the opposite ( you need to enter the uniportID). Is there any way to search by the protein sequence?

sequence sequencing • 3.1k views
ADD COMMENT
0
Entering edit mode

Are you asking do sequence similarity search tools exist?

ADD REPLY
0
Entering edit mode

No, I am asking to find organism name.

Thanks

ADD REPLY
0
Entering edit mode

Following sequence similarity search, you don't get organism name straight from the header or by mapping the ID to organism name?

ADD REPLY
0
Entering edit mode

So, how can I get the organism name(s) for the protein sequence string.

ADD REPLY
0
Entering edit mode

Basically you either parse it straight from your results (i.e. if species name is in the reference db headers) or then you map it from the identifiers of your reference db headers.

ADD REPLY
0
Entering edit mode
10.0 years ago
Hamish ★ 3.2k

To look-up a protein sequence and find the database identifiers associated with it (i.e. database entries having the same sequence) you can use:

  • PICR (Protein Identifier Cross-Reference): this service provides options that perform look-up of protein entry identifiers, look-up of protein sequence, and sequence similarity search (i.e. NCBI BLAST).

Alternatively you could use a tool such as EMBL-EBI's Sequence checksum calculator to derive a checksum for the sequence, and then use this to search in the protein databases. For UniProtKB this would mean the CRC64-ISO value produced by the "Sequence checksum calculator" (e.g. 1C2E8ADA9FD97949 for the UniProtKB:WAP_RAT sequence) would be used to search the database (e.g. using EBI Search) to find the corresponding entries.

Note: the "Sequence checksum calculator" and search option is also applicable to nucleotide sequences. For example, for the EMBL-Bank:L12345 sequence:

  • "Sequence checksum calculator" gives the MD5 value: 16048c86c30b164927c7a402bbdbcb35
  • Using EBI Search to search with the MD5 value gives EMBL-Bank L12345 as the search result.

If you want to find similar sequences (i.e. not the exact same sequence) you will need to use sequence similarity searches, for examples see:

All of the services detailed above are available via Web Services which can be accessed for Java, see:

ADD COMMENT

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6