Question

What Is The Difference Between Nr And Trembl Database?

3

Entering edit mode

10.7 years ago

Pappu ★ 2.1k

I used to think that both databases contain similar entries. It seems that I was not correct.

database • 18k views

ADD COMMENT • link updated 10.7 years ago by Hamish ★ 3.2k • written 10.7 years ago by Pappu ★ 2.1k

score 10 · Answer 1 · 2013-07-24

UniProtKB/TrEMBL (often referred to using its pre-UniProt name: TrEMBL) is the unreviewed section of UniProtKB. These are protein sequences harvested from the coding sequence (CDS) features of entries in the ENA EMBL-Bank database (part of the International Nucleotide Sequence Database Collaboration, along with GenBank and DDBJ). The name TrEMBL refers to this source, being an abbreviation of translated EMBL-Bank. Due to the nature of the source UniProtKB/TrEMBL is highly redundant and the quality of the annotation is very variable. As well as the original annotations carried over from EMBL-Bank additional annotations are added based on a series of automated annotation workflows. As the entries in UniProtKB/TrEMBL and manually reviewed by the UniProt curators they graduate into UniProtKB/Swiss-Prot (the human curated section of UniProtKB) and may be merged into existing entries which describe the same gene in the same species.

In contrast NCBI's nr database comprised of sequences obtained from:

Translations of GenBank coding sequences, often referred to as GenPept
UniProtKB/Swiss-Prot
PIR
PRF
PDB
NCBI RefSeq

This corresponds to the 'protein' database available in NCBI Entrez: http://www.ncbi.nlm.nih.gov/protein. These sequences are then processed to produce a non-identical (often referred to as pseudo non-redundant, and thus the name 'nr') database of sequences. Thus in the 'nr' database each sequence occurs once, but may have multiple source entries in the source databases.

As such 'nr' has greater coverage of the protein sequence space than UniProtKB/TrEMBL, but can be more complex to relate back to the original data sources, especially when looking for annotation, since a single sequence in 'nr' may correspond to proteins from many species described in different databases.

Of the UniProt databases the NCBI's 'nr' database is most similar to the UniProt Archive (UniParc) database, which is also non-identical and includes the same protein sequence sources as 'nr' but adds some additional protein sequence databases (see http://www.uniprot.org/help/uniparc).

For more coverage of protein sequence space consider looking at SIMAP, which has sequences for many additional databases and thus contains some additional sequences, and has computed alignments for all of these.

score 1 · Answer 2 · 2013-07-23

SWISSPROT- TrEMBL
Is "a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases".

The non-redundant database should theoretically be all "non-redundant" sequences (aka, replicated, sub-sequences... etc) which have been submitted to the major databases.

Over in the Forum, user csmpresent has linked to his site which has a lot of information on: http://bioinformaticssoftwareandtools.co.in/

score 1 · Answer 3 · 2013-07-23

There is more to it than this (I might write a blog post one day). And some is not even clear at the moment. The position for SwisProt with merging entries into a cannonical entry, with all differences recorded as feature lines (including merging PDBs that are seperate in nr) is clearly documented. RefSeq does not technically merge (unless 100% overlap) it just chooses one (so that goes in twice). However as TrEMBL is 40x bigger and the ratio continues to go up the Swiss-Prot set becomes almost irrelevent in terms of the whole protein set on either side. Nominally this should make UniProt more similar to nr than it used to be. However I have never seem any documentation on what is happening to genomic derived proteins on both sides as TrEMBL used to be cDNA only but not any more. For example sometimes the XP is the same as a TrEMBL, sometimes different, and sometimes the XPs dissapear.

At the end of the day if you really want to capture all the data on one protein, you'll have to search both, and Ensembl just to be complete (and if you really mean all, don't forget the patent divisions)