Question

[Uniprot] The Protein Has Complete Sequence: Is It Always The Case?

4

Entering edit mode

13.6 years ago

Pals ★ 1.3k

I have made modeling and docking studies on a protein. According to entires in UniProt, the sequence of the protein is complete. However, I have a feeling that about 20-30 residues at its N-terminal region are missing. Because, its template as well as the structures that belong to the same family have the corresponding region and that portion seems to be catalytically important too (In other structures of this family, the residues play important role in fixing the ligand in right orientation). I wanted to verify it so I did blast search against nr database. And I got to know that there are no any sequences matching with the N-terminal most region of my protein. For example all of the homologous proteins start from 20-35 residues. In this case, can I propose that the protein sequence is incomplete?

docking modeling uniprot • 3.4k views

ADD COMMENT • link updated 8.4 years ago by Biostar 20 • written 13.6 years ago by Pals ★ 1.3k

2

Entering edit mode

Does your sequence come from SwissProt or TrEMBL?

ADD REPLY • link 13.6 years ago by Michael Schubert ★ 7.1k

1

Entering edit mode

It came from TrEMBL (Q56917)

ADD REPLY • link 13.6 years ago by Pals ★ 1.3k

1

Entering edit mode

The background of Michael's question almost certainly is that UniProt contains both well curated proteins (from SwissProt and PIR) and automatically translated nucleotide sequences (from trEMBL). The latter are much more likely to contain errors.

ADD REPLY • link 13.6 years ago by Chris Evelo 10k

0

Entering edit mode

It came from SwissProt.

ADD REPLY • link 13.6 years ago by Pals ★ 1.3k

0

Entering edit mode

Yes, sorry, forgot to follow up on this.

ADD REPLY • link 13.6 years ago by Michael Schubert ★ 7.1k

score 3 · Answer 1 · 2011-04-07

3

Entering edit mode

13.6 years ago

Larry_Parnell 16k

You certainly can! When I was analyzing gene models for Arabidopsis thaliana and human genome projects, this is precisely the kind of result that indicated an error in the gene model and hence in the conceptual translation into protein. In fact, most such errors were found at the N terminus, just as in your example. Without genomic sequence in hand, it may be difficult for you to model your protein - because you need to find a new exon 1 (maybe more because this gene model is very likely to extend farther in the 5' or upstream direction). One approach if that genomic sequence does not exist is to "borrow" the missing residues from the top BLASTP hit as a surrogate for the N terminus - for the purpose of modeling.

Please let me know if I should provide more details for finding the missing ~20-35 residues.

ADD COMMENT • link 13.6 years ago by Larry_Parnell 16k

1

Entering edit mode

Do you have access to any genome sequence data? This could be from your own data or from ESTs or a genome project on the organism you're studying? If so, take the N-term of the protein matching at 82% and use that as query in a search against those DNA sequences.

ADD REPLY • link 13.6 years ago by Larry_Parnell 16k

0

Entering edit mode

Thanks Larry!! The top BLASTP hit after itself is another protein that has 82% sequence identity. In that case, how can we be sure that those residues in N-terminal are the one that should be present in our protein. Of course, it could provide the secondary structure to make docking studies. And I am sure the residues that are catalytically significant are present in that protein too but if I will be unable to get the exact residues, I don't think it would make much sense.

ADD REPLY • link 13.6 years ago by Pals ★ 1.3k

0

Entering edit mode

This protein is from yersinia enterocolitica. However, its genome has not been sequenced yet (its undergoing probably).

ADD REPLY • link 13.6 years ago by Pals ★ 1.3k

0

Entering edit mode

I searched for that sequence but I did not get satisfying results because the strain Ye O:3 has not been sequence yet. However, I made a try in Ensemble.

ADD REPLY • link 13.6 years ago by Pals ★ 1.3k

score 1 · Answer 2 · 2011-04-07

1

Entering edit mode

13.6 years ago

Jerven ▴ 660

Yes the real protein as actually expressed does not always match the reported sequence in UniProtKB. This is especially true for sequences in the TrEMBL section. As the protein sequence prediction can be very bad. Which means care needs to be taken as you are doing now.

However, you can always request an update of the sequence and that its integrated into swiss-prot using the contact link on uniprot.org (top right in the blue bar). No guarantee that there is curator time available, but it never hurts too ask.

ADD COMMENT • link 13.6 years ago by Jerven ▴ 660

0

Entering edit mode

Yes, I have asked for an update. But its very unlikely that they will add the missing residues. Instead they might mark it as an incomplete sequence..:)

ADD REPLY • link 13.6 years ago by Pals ★ 1.3k