What Is The Difference Between Nr And Trembl Database?
3
3
Entering edit mode
8.4 years ago
Pappu ★ 2.0k

I used to think that both databases contain similar entries. It seems that I was not correct.

database • 16k views
ADD COMMENT
10
Entering edit mode
8.4 years ago
Hamish ★ 3.2k

UniProtKB/TrEMBL (often referred to using its pre-UniProt name: TrEMBL) is the unreviewed section of UniProtKB. These are protein sequences harvested from the coding sequence (CDS) features of entries in the ENA EMBL-Bank database (part of the International Nucleotide Sequence Database Collaboration, along with GenBank and DDBJ). The name TrEMBL refers to this source, being an abbreviation of translated EMBL-Bank. Due to the nature of the source UniProtKB/TrEMBL is highly redundant and the quality of the annotation is very variable. As well as the original annotations carried over from EMBL-Bank additional annotations are added based on a series of automated annotation workflows. As the entries in UniProtKB/TrEMBL and manually reviewed by the UniProt curators they graduate into UniProtKB/Swiss-Prot (the human curated section of UniProtKB) and may be merged into existing entries which describe the same gene in the same species.

In contrast NCBI's nr database comprised of sequences obtained from:

This corresponds to the 'protein' database available in NCBI Entrez: http://www.ncbi.nlm.nih.gov/protein. These sequences are then processed to produce a non-identical (often referred to as pseudo non-redundant, and thus the name 'nr') database of sequences. Thus in the 'nr' database each sequence occurs once, but may have multiple source entries in the source databases.

As such 'nr' has greater coverage of the protein sequence space than UniProtKB/TrEMBL, but can be more complex to relate back to the original data sources, especially when looking for annotation, since a single sequence in 'nr' may correspond to proteins from many species described in different databases.

Of the UniProt databases the NCBI's 'nr' database is most similar to the UniProt Archive (UniParc) database, which is also non-identical and includes the same protein sequence sources as 'nr' but adds some additional protein sequence databases (see http://www.uniprot.org/help/uniparc).

For more coverage of protein sequence space consider looking at SIMAP, which has sequences for many additional databases and thus contains some additional sequences, and has computed alignments for all of these.

ADD COMMENT
0
Entering edit mode

Thanks Hamish, any chance of anyone (e.g. anywhere on this board) using CDHit to intersect and diff nr vs Uniprot ? Heavy lifting (40x 40 mill) but doable in principle (or does SIMAP allready have the stats). Good Masters project ? (once upon a time IPI sort of did this on a species level but- alas - RIP, )

ADD REPLY
2
Entering edit mode

Currently NCBI's 'nr' contains 31,281,978 unique protein sequences, these are all contained within UniParc's total of 46,108,527 protein sequences (of which 44,199,360 are "active" and currently appear in other databases). If instead you meant UniProtKB, then 'nr' contains most of the sequences in UniProtKB (CDS translations from the GenBank environmental and WGS sections are excluded from 'nr' but can appear in UniProtKB). You can use the fact that UniParc contains all these sequences, and that the NCBI gi numbers included in UniParc are derived from 'nr' to figure out the overlaps using queries on the UniProt website. Note that due to the different release cycles for 'nr' and the UniProt databases there will always be a small difference due to pending inclusions of new sequences.

SIMAP contains additional protein sequences (although it is unclear to me from where these actually come), and since they maintain a complete all vs. all set of alignments for these sequences, producing a sequence clustering from these should be relatively simple. If you are interested in doing that I suggest you contact them.

FWIW the IPI sequence data is included in UniParc and UniProt provides equivalent datasets for whole proteomes which integrate information from the same sources as IPI did (see http://www.uniprot.org/news/2011/06/28/release).

ADD REPLY
1
Entering edit mode
8.4 years ago
Daniel ★ 3.8k

SWISSPROT- TrEMBL
Is "a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases".

The non-redundant database should theoretically be all "non-redundant" sequences (aka, replicated, sub-sequences... etc) which have been submitted to the major databases.

Over in the Forum, user csmpresent has linked to his site which has a lot of information on: http://bioinformaticssoftwareandtools.co.in/

ADD COMMENT
1
Entering edit mode
8.4 years ago
cdsouthan ★ 1.9k

There is more to it than this (I might write a blog post one day). And some is not even clear at the moment. The position for SwisProt with merging entries into a cannonical entry, with all differences recorded as feature lines (including merging PDBs that are seperate in nr) is clearly documented. RefSeq does not technically merge (unless 100% overlap) it just chooses one (so that goes in twice). However as TrEMBL is 40x bigger and the ratio continues to go up the Swiss-Prot set becomes almost irrelevent in terms of the whole protein set on either side. Nominally this should make UniProt more similar to nr than it used to be. However I have never seem any documentation on what is happening to genomic derived proteins on both sides as TrEMBL used to be cDNA only but not any more. For example sometimes the XP is the same as a TrEMBL, sometimes different, and sometimes the XPs dissapear.

At the end of the day if you really want to capture all the data on one protein, you'll have to search both, and Ensembl just to be complete (and if you really mean all, don't forget the patent divisions)

ADD COMMENT

Login before adding your answer.

Traffic: 1906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6