Question: Help Regarding Redundant Entries Of Uniprotkb/Trembl
3
gravatar for Ananth
10.4 years ago by
Ananth30
Ananth30 wrote:

I am using UniProtKB to download protein sequences of Argonaute super-family (Query = Argonaute OR Piwi). The hits contain 194 UniProtKB/Swiss-Prot and 888 UniProtKB/TrEMBL entries.

On further analysis of these hits I find that UniProtKB/TrEMBL entries are redundant, on the other hand UniProtKB/Swiss-Prot gives one record per gene in one species.

I am in a dilemma as to which sequences/entries to consider from UniProtKB/TrEMBL for a particular protein from a specie, since there are multiple entries per gene for the same specie with different accession numbers.

For Ex. the protein Seawi from Strongylocentrotus purpuratus has only one gene but UniProtKB/TrEMBL lists 4 accessions (Q9GPA7, Q9GPA8, Q9GPA6, C9EID6) with varying sequence length.

There are large number of sequences which I will be missing out if I use only UniProtKB/Swiss-Prot sequences.

Kindly help me on this...

uniprot • 2.5k views
ADD COMMENTlink modified 9.5 years ago by Fidel2.0k • written 10.4 years ago by Ananth30
10
gravatar for Neilfws
10.4 years ago by
Neilfws49k
Sydney, Australia
Neilfws49k wrote:

The issue here is that you are dealing with two separate databases, each of which is designed for a different purpose.

UniProtKB/TrEMBL was designed to deal with high-throughput data (e.g. from genome sequencing), by applying automated analyses. Nucleotide sequences from EMBL-Bank/GenBank/DDBJ, annotated as coding, are translated and annotated "automatically", using a computational pipeline. These protein sequences may therefore contain errors and are frequently not full-length. TrEMBL is non-redundant in the sense that identical, full-length sequences from the same organism are represented by a single record, but there may be many records for fragments, isoforms etc., derived from the same protein.

Entries in UniProtKB/Swiss-Prot are curated and reviewed manually. They are non-redundant in the sense that each record represents one "gene". Fragments, isoforms etc. can then be derived from the feature table.

How you use the data depends on precisely what you want to do. You might think of TrEMBL data as less "reliable", so you're not necessarily "missing out" by not using it.

Some useful links:

ADD COMMENTlink written 10.4 years ago by Neilfws49k

Thank you very much for the reply. It was of great help.

ADD REPLYlink written 10.4 years ago by Ananth30

Glad to hear it. Feel free to vote for the answer then :-)

ADD REPLYlink written 10.4 years ago by Neilfws49k
5
gravatar for Lars Juhl Jensen
10.4 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

The problems that you mention are the very reason why we do not use UniProt as the source of sequences for STRING and related databases: UniProtKB/SwissProt does not contain sequences for all genes, and UniProt/TrEMBL oftentimes contains multiple entries for a single gene, with no easy way to construct a unique set.

What I do is to instead rely on genome-centric databases (such as Ensembl and Refseq genomes) in which it is explicit which proteins are encoded by the same locus. All you have to do is then to decide which of the splice isoforms you want to use; one option is to use the longest isoform in order to cover as much of the coding potential of each gene as possible (using only a single isoform).

If you are specifically interested in the sea urchin, I would recommend that you base your analysis on the data from the Sea Urchin Genome Project.

ADD COMMENTlink modified 10.4 years ago • written 10.4 years ago by Lars Juhl Jensen11k

Thank you very much for the reply. It was of great help.

ADD REPLYlink written 10.4 years ago by Ananth30
2
gravatar for Michael Kuhn
10.4 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

If you expect that there's only one copy per species (or if it doesn't matter that you miss a second copy), just take the longest protein from each species. As Lars pointed out, even if you use a complete genome, you will still have multiple splice forms and need to decide which of them to use.

ADD COMMENTlink written 10.4 years ago by Michael Kuhn5.0k

Thank you very much for the reply. It was of great help.

ADD REPLYlink written 10.4 years ago by Ananth30
1
gravatar for Fidel
9.6 years ago by
Fidel2.0k
Germany
Fidel2.0k wrote:

A section of the UniProtKB FAQ states

"UniProtKB/TrEMBL is 'non-redundant' in the sense that all identical, full-length protein sequences, provided they come from the same species, are represented in a single record. UniProtKB/TrEMBL sequences are translations of CDS submitted to the EMBL-Bank/GenBank/DDBJ databases and cross-references to the original submissions are kept in the entries. Fragments, isoforms, variants and so on, encoded by the same gene, are stored in separate entries."

But they also say:

"Identical protein sequence which are': Fragments, isoforms, variants and so on, encoded by the same gene, are stored in separate entries."

A solution to avoid redundancy is to use UniRef100 which combines identical sequences and subfragments and has mappings to UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.

ADD COMMENTlink written 9.6 years ago by Fidel2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2659 users visited in the last hour
_