Help Regarding Redundant Entries Of Uniprotkb/Trembl
4
3
Entering edit mode
13.7 years ago
Ananth ▴ 30

I am using UniProtKB to download protein sequences of Argonaute super-family (Query = Argonaute OR Piwi). The hits contain 194 UniProtKB/Swiss-Prot and 888 UniProtKB/TrEMBL entries.

On further analysis of these hits I find that UniProtKB/TrEMBL entries are redundant, on the other hand UniProtKB/Swiss-Prot gives one record per gene in one species.

I am in a dilemma as to which sequences/entries to consider from UniProtKB/TrEMBL for a particular protein from a specie, since there are multiple entries per gene for the same specie with different accession numbers.

For Ex. the protein Seawi from Strongylocentrotus purpuratus has only one gene but UniProtKB/TrEMBL lists 4 accessions (Q9GPA7, Q9GPA8, Q9GPA6, C9EID6) with varying sequence length.

There are large number of sequences which I will be missing out if I use only UniProtKB/Swiss-Prot sequences.

Kindly help me on this...

uniprot • 3.7k views
ADD COMMENT
10
Entering edit mode
13.7 years ago
Neilfws 49k

The issue here is that you are dealing with two separate databases, each of which is designed for a different purpose.

UniProtKB/TrEMBL was designed to deal with high-throughput data (e.g. from genome sequencing), by applying automated analyses. Nucleotide sequences from EMBL-Bank/GenBank/DDBJ, annotated as coding, are translated and annotated "automatically", using a computational pipeline. These protein sequences may therefore contain errors and are frequently not full-length. TrEMBL is non-redundant in the sense that identical, full-length sequences from the same organism are represented by a single record, but there may be many records for fragments, isoforms etc., derived from the same protein.

Entries in UniProtKB/Swiss-Prot are curated and reviewed manually. They are non-redundant in the sense that each record represents one "gene". Fragments, isoforms etc. can then be derived from the feature table.

How you use the data depends on precisely what you want to do. You might think of TrEMBL data as less "reliable", so you're not necessarily "missing out" by not using it.

Some useful links:

ADD COMMENT
0
Entering edit mode

Thank you very much for the reply. It was of great help.

ADD REPLY
0
Entering edit mode

Glad to hear it. Feel free to vote for the answer then :-)

ADD REPLY
5
Entering edit mode
13.7 years ago

The problems that you mention are the very reason why we do not use UniProt as the source of sequences for STRING and related databases: UniProtKB/SwissProt does not contain sequences for all genes, and UniProt/TrEMBL oftentimes contains multiple entries for a single gene, with no easy way to construct a unique set.

What I do is to instead rely on genome-centric databases (such as Ensembl and Refseq genomes) in which it is explicit which proteins are encoded by the same locus. All you have to do is then to decide which of the splice isoforms you want to use; one option is to use the longest isoform in order to cover as much of the coding potential of each gene as possible (using only a single isoform).

If you are specifically interested in the sea urchin, I would recommend that you base your analysis on the data from the Sea Urchin Genome Project.

ADD COMMENT
0
Entering edit mode

Thank you very much for the reply. It was of great help.

ADD REPLY
2
Entering edit mode
13.7 years ago

If you expect that there's only one copy per species (or if it doesn't matter that you miss a second copy), just take the longest protein from each species. As Lars pointed out, even if you use a complete genome, you will still have multiple splice forms and need to decide which of them to use.

ADD COMMENT
0
Entering edit mode

Thank you very much for the reply. It was of great help.

ADD REPLY
1
Entering edit mode
12.8 years ago
Fidel ★ 2.0k

A section of the UniProtKB FAQ states

"UniProtKB/TrEMBL is 'non-redundant' in the sense that all identical, full-length protein sequences, provided they come from the same species, are represented in a single record. UniProtKB/TrEMBL sequences are translations of CDS submitted to the EMBL-Bank/GenBank/DDBJ databases and cross-references to the original submissions are kept in the entries. Fragments, isoforms, variants and so on, encoded by the same gene, are stored in separate entries."

But they also say:

"Identical protein sequence which are': Fragments, isoforms, variants and so on, encoded by the same gene, are stored in separate entries."

A solution to avoid redundancy is to use UniRef100 which combines identical sequences and subfragments and has mappings to UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.

ADD COMMENT

Login before adding your answer.

Traffic: 1977 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6