I have been trying to use Biopython to parse out certain domains from proteins and it was suggested to use the Bio.SwissProt module. Unfortunately, I don't see any SwissProt data files available on UniProt. The only available file formats are GGF, FASTA, XML, and TXT. Anyone know how I can get access to the Swiss-Prot file format?
The "text" files (also known as 'dat' files) are the files in UniProtKB/SwissProt format, so you can fetch these with:
or using one of the many mirrors:
Note: the UniProtKB/TrEMBL file is large (approx. 20GB compressed and about 110GB uncompressed) so you will likely only want to download this if you need to. See Why is UniProtKB composed of 2 sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL? for an overview of the differences between UniProtKB/SwissProt and UniProtKB/TrEMBL.
If you need the whole database fetches like the above are recommended.
UniProt also provide subsets of the database based on:
- Taxonomic classification: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/
- Proteomes (replacement for IPI): ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomes/
- Complete and reference proteomes: http://www.uniprot.org/taxonomy/complete-proteomes
Which may be more appropriate if you are only interested in certain organisms.
For specific entries, where you already have a list of identifiers or accessions, the various web services providing access to the UniProtKB data are more appropriate. For example:
- UniProt.org: http://www.uniprot.org/faq/28
- EMBL-EBI dbfetch: http://www.ebi.ac.uk/Tools/dbfetch/
- EMBL-EBI WSDbfetch: http://www.ebi.ac.uk/Tools/webservices/services/dbfetch