Question

Help Regarding Redundant Entries Of Uniprotkb/Trembl

3

Entering edit mode

13.9 years ago

Ananth ▴ 30

I am using UniProtKB to download protein sequences of Argonaute super-family (Query = Argonaute OR Piwi). The hits contain 194 UniProtKB/Swiss-Prot and 888 UniProtKB/TrEMBL entries.

On further analysis of these hits I find that UniProtKB/TrEMBL entries are redundant, on the other hand UniProtKB/Swiss-Prot gives one record per gene in one species.

I am in a dilemma as to which sequences/entries to consider from UniProtKB/TrEMBL for a particular protein from a specie, since there are multiple entries per gene for the same specie with different accession numbers.

For Ex. the protein Seawi from Strongylocentrotus purpuratus has only one gene but UniProtKB/TrEMBL lists 4 accessions (Q9GPA7, Q9GPA8, Q9GPA6, C9EID6) with varying sequence length.

There are large number of sequences which I will be missing out if I use only UniProtKB/Swiss-Prot sequences.

Kindly help me on this...

uniprot • 3.8k views

ADD COMMENT • link updated 13.0 years ago by Fidel ★ 2.0k • written 13.9 years ago by Ananth ▴ 30

score 10 · Answer 1 · 2010-08-19

The issue here is that you are dealing with two separate databases, each of which is designed for a different purpose.

UniProtKB/TrEMBL was designed to deal with high-throughput data (e.g. from genome sequencing), by applying automated analyses. Nucleotide sequences from EMBL-Bank/GenBank/DDBJ, annotated as coding, are translated and annotated "automatically", using a computational pipeline. These protein sequences may therefore contain errors and are frequently not full-length. TrEMBL is non-redundant in the sense that identical, full-length sequences from the same organism are represented by a single record, but there may be many records for fragments, isoforms etc., derived from the same protein.

Entries in UniProtKB/Swiss-Prot are curated and reviewed manually. They are non-redundant in the sense that each record represents one "gene". Fragments, isoforms etc. can then be derived from the feature table.

How you use the data depends on precisely what you want to do. You might think of TrEMBL data as less "reliable", so you're not necessarily "missing out" by not using it.

Some useful links:

score 5 · Answer 2 · 2010-08-19

The problems that you mention are the very reason why we do not use UniProt as the source of sequences for STRING and related databases: UniProtKB/SwissProt does not contain sequences for all genes, and UniProt/TrEMBL oftentimes contains multiple entries for a single gene, with no easy way to construct a unique set.

What I do is to instead rely on genome-centric databases (such as Ensembl and Refseq genomes) in which it is explicit which proteins are encoded by the same locus. All you have to do is then to decide which of the splice isoforms you want to use; one option is to use the longest isoform in order to cover as much of the coding potential of each gene as possible (using only a single isoform).

If you are specifically interested in the sea urchin, I would recommend that you base your analysis on the data from the Sea Urchin Genome Project.

score 2 · Answer 3 · 2010-08-19

2

Entering edit mode

13.9 years ago

Michael Kuhn 5.0k

If you expect that there's only one copy per species (or if it doesn't matter that you miss a second copy), just take the longest protein from each species. As Lars pointed out, even if you use a complete genome, you will still have multiple splice forms and need to decide which of them to use.

ADD COMMENT • link 13.9 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Thank you very much for the reply. It was of great help.

ADD REPLY • link 13.9 years ago by Ananth ▴ 30

score 1 · Answer 4 · 2011-06-29

A section of the UniProtKB FAQ states

"UniProtKB/TrEMBL is 'non-redundant' in the sense that all identical, full-length protein sequences, provided they come from the same species, are represented in a single record. UniProtKB/TrEMBL sequences are translations of CDS submitted to the EMBL-Bank/GenBank/DDBJ databases and cross-references to the original submissions are kept in the entries. Fragments, isoforms, variants and so on, encoded by the same gene, are stored in separate entries."

But they also say:

"Identical protein sequence which are': Fragments, isoforms, variants and so on, encoded by the same gene, are stored in separate entries."

A solution to avoid redundancy is to use UniRef100 which combines identical sequences and subfragments and has mappings to UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.