Different Swiss-Prot "versions"
2
1
Entering edit mode
4.0 years ago
Twitty ▴ 30

Hi, everyone, I want to download several databases for subsequent use in transcriptome annotation pipeline. One of the databases is Swiss-Prot (also called UniProtKB/Swiss-Prot). I understand that the main source is www.uniprot.org and ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/ in particular. However the one can also find Swiss-Prot distribution in ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ directory.

Both of them have been updated recently They're not identical starting from number of sequences and to header format, e.g. while file from UniProt contains 561911 entries in NCBI file you can find only 473509 (original Swiss-Prot according to https://www.uniprot.org/statistics/Swiss-Prot has this number 10 years back). At NCBI website I could find only that it's "Last major release of the UniProtKB/SWISS-PROT protein sequence database (no incremental updates)." (https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf)

In the end I'm curious what's the file deposited at NCBI, does someone know?

All the best for everyone.

protein databases Swiss-Prot • 1.2k views
ADD COMMENT
1
Entering edit mode
4.0 years ago
Twitty ▴ 30

Thanks for the advice, JC, I hadn't thought they will answer such a small question but surprisingly I got it and give below:

In short, this is due to NON-redundant nature of the BLAST database.

See explanation appended below for more technical details. Regards, NCBI User Services

Stat from this file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/relnotes.txt UniProtKB/Swiss-Prot: 561,911 entries

matches closely to what is available here from NCBI Protein database: https://www.ncbi.nlm.nih.gov/protein?term=%22swissprot%22%5BFilter%5D Items: 1 to 20 of 561499

So what in the BLAST database description is essentially correct.

Even though BLAST gives something different: Title:Non-redundant UniProtKB/SwissProt sequences. Molecule Type:Protein Update date:2020/04/09 Number of sequences:473509

This is after identical sequences are collapsed into ipg, each group will contain control-A char in the defline, each will have 2 or more sequences in it:

$ gunzip -c db/FASTA/swissprot.gz | grep ">" | grep -c $'\01' 38217

For example, swissprot has this entry:

$ curl 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz' |gunzip -c | grep "Q7V8U8"

sp|Q7V8U8|YIDD_PROMM Putative membrane protein insertion efficiency factor OS=Prochlorococcus marinus (strain MIT 9313) OX=74547 GN=PMT_0228 PE=3 SV=1

It is part of a non-redundant set in blast database: $ gunzip -c db/FASTA/swissprot.gz | grep ">" | grep Q7V8U8

A2CBJ0.1 RecName: Full=Putative membrane protein insertion efficiency factor [Prochlorococcus marinus str. MIT 9303]Q7V8U8.1 RecName: Full=Putative membrane protein insertion efficiency factor [Prochlorococcus marinus str. MIT 9313]

So two swissprot sequences are collapsed into a single entry in this case. Some set/group have quite a few sequences collapsed in them, this would make the number of sequences much larger than 38K, making up for the differences.

The one may have heard that non-redundancy is a pretty flexible term and in this case we can see the example. Taking a look on UniProt web page about their databases redundancy (https://www.uniprot.org/help/redundancy) we read the following regarding Swiss-Prot:

  • UniProtKB/Swiss-Prot is 'non-redundant' in the sense that all protein products encoded by one gene in a given species are represented in a single record. This includes alternative splicing isoforms, fragments, polymorphisms, sequence conflicts, etc. Differences between sequence reports are analyzed, fully documented and reported in the entry. Cross-references to the original submissions to EMBL-Bank/GenBank/DDBJ databases are kept (see for instance, Q9BXB7).

Inspecting sequences of entries Q7V8U8 and A2CBJ0 it's notable that the sequences are identical but belong to different strains of the same specie. So, in NCBI Swiss-Prot database these two entries are collapsed into one which is more sensible for me.

ADD COMMENT
0
Entering edit mode

NCBI support is always supporting and responsive

ADD REPLY
1
Entering edit mode
4.0 years ago
JC 13k

You can ask NCBI people directly here https://www.ncbi.nlm.nih.gov/home/about/contact/

ADD COMMENT

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6