Retrieve Trembl/Uniprot previous version
1
0
Entering edit mode
7.1 years ago
leon • 0

Hi,

Im trying to retrieve the Human Uniprot/Trembl release 2014_09 as a fasta file, which is supposed to contain about 86.000 sequences.

Unfortunately, when I look into previous releases in the uniprot ftp server, I only find the extraordinary large .dat file containing all sequences of all species. So that I can't even parse out the human entries without a memory error.

When I use the "date of" filter of the uniprot web interface, I find more than 86.000 unreviewed sequences. And I am not sure which starting date I am supposed to choose. However choosing from 01.01.2002 to 01.09.2014 already results in more than 100.000 unreviewed and only 10.000 reviewed sequences.

Is there a way to access this release in an easy way?

Thanks for your help, Leon

sequence proteome uniprot trembl • 1.9k views
ADD COMMENT
0
Entering edit mode
7.1 years ago

UniProtKB/TrEMBL release 2014_09 contained 118322 human entries, and UniProtKB/Swiss-Prot, for the same release, contained 20195 human entries:

ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_09/knowledgebase/UniProtKB_SwissProt-relstat.html

ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_09/knowledgebase/UniProtKB_TrEMBL-relstat.html

If you wanted to use the approach you tried, using the creation date on the UniProt website, the query would be:

http://www.uniprot.org/uniprot/?query=created%3A[19860101+TO+20141001]+AND+organism%3A%22Homo+sapiens+%28Human%29+[9606]%22&sort=score

which returns 20,024 Swiss-Prot and 116,256 TrEMBL entries.

The reason why these numbers do not correspond to the numbers from the release notes is that there have been deletions, merges and demerges, and entries that were present in 2014_09 may no longer exist now.

The approach I suggest is downloading the full UniProt Knowledgebase for that release (ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_09/knowledgebase/), in .dat format and extract only those entries that contain NCBI_TaxID=9606

Then convert this into FASTA format as described in A: Converting Uniprot File to a Fasta File in Perl

ADD COMMENT

Login before adding your answer.

Traffic: 3114 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6