20 months ago
Chvatil ▴ 90

Hello everyone, I'm looking for a bash code in order to download from uniprot proteoms all the protein fasta sequences from Bacteria and protits proteoms, does someone know how I can do it please?

Hello, I downloaded the file : https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_bacteria.dat.gz and transform the .dat into .fasta using the python function Bio.SwissProt but I only get 335 066 fasta bacterial sequence despti the fact that when I type on uniprot : taxonomy:bacteria in the research tab I up to 151,792,141 bacterial sequence. Do you know why?

You have a better solution provided by Elisabeth Gasteiger below.

You can use seqret from EMBOSS to convert the dat files to fasta. I am not sure why you get a smaller number of entries. Perhaps redundant sequences are represented only once.

Ok I see, in fact I only download the swissprot part and not the Trembl part, I will check if the number of entries is good from that.

20 months ago

This help page on the UniProt website https://www.uniprot.org/help/api_downloading includes a code example to "Download the UniProt reference proteomes for all organisms below a given taxonomy node in compressed FASTA format"

How fine, I'll try that one thanks

Hi, I used this technique but at the end I only found 1,335,574 fasta sequences instead of 151,792,141, any idea ?

I use the following command : perl perl_test.pl 2 (where perl_test.pl is the code in Uniprot webpage)