Question

Downloading all UniProt dataset

0

Entering edit mode

6.5 years ago

srdjanmasirevic2 ▴ 10

I have to download all UniProt dataset (txt files). Does anybody know is it possible to download whole dataset with txt extension, but without all informations about protein. I only need info about ID, Name, Function, Sequence, Structure and Resolution? So it would look something like this for only one UniProt ID. Thanks in advance

ID   AQP1_HUMAN              Reviewed;         269 AA.
AC   P29972; B5BU39; E7EM69; E9PC21; F5GY19; Q8TBI5; Q8TDC1;
DE   RecName: Full=Aquaporin-1;
DE            Short=AQP-1;
DE   AltName: Full=Aquaporin-CHIP;
DE   AltName: Full=Urine water channel;
DE   AltName: Full=Water channel protein for red blood cells and kidney proximal tubule;
DR   PDB; 1FQY; X-ray; 3.80 A; A=1-269.
DR   PDB; 1H6I; X-ray; 3.54 A; A=1-269.
DR   PDB; 1IH5; X-ray; 3.70 A; A=1-269.
DR   PDB; 4CSK; X-ray; 3.28 A; A=1-269.
SQ   SEQUENCE   269 AA;  28526 MW;  BA204D82FB26352E CRC64;
     MASEFKKKLF WRAVVAEFLA TTLFVFISIG SALGFKYPVG NNQTAVQDNV KVSLAFGLSI
     ATLAQSVGHI SGAHLNPAVT LGLLLSCQIS IFRALMYIIA QCVGAIVATA ILSGITSSLT
     GNSLGRNDLA DGVNSGQGLG IEIIGTLQLV LCVLATTDRR RRDLGGSAPL AIGLSVALGH
     LLAIDYTGCG INPARSFGSA VITHNFSNHW IFWVGPFIGG ALAVLIYDFI LAPRSSDLTD
     RVKVWTSGQV EEYDLDADDI NSRVEMKPK

Assembly • 2.6k views

ADD COMMENT • link updated 6.4 years ago by Elisabeth Gasteiger ★ 2.4k • written 6.5 years ago by srdjanmasirevic2 ▴ 10

score 2 · Answer 1 · 2017-11-07

2

Entering edit mode

6.5 years ago

GenoMax 141k

Take a look at the README file at this Uniprot FTP site. Then choose the file you need.

ADD COMMENT • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

oh I see..Thanks man!

ADD REPLY • link 6.5 years ago by srdjanmasirevic2 ▴ 10

score 2 · Answer 2 · 2017-11-21

You could download the complete flat file (either from the UniProt FTP site, or from your query result page on the website) and then use grep or some scripting language to keep only these line types. If you need assistance with regular expressions to obtain exactly this data, please don't hesitate to contact the UniProt helpdesk. The flat file format is documented at http://www.uniprot.org/docs/userman.htm

UniProt offer tab-delimited download from the website (http://www.uniprot.org/help/customize, http://insideuniprot.blogspot.ch/2015_03_01_archive.html)

This would work perfectly in your case, keeping columns for identifiers, protein names, PDB cross-references and sequence. However, we do unfortunately have a limitation for tab-separated cross-reference download: While the html version of the result table contains the full cross-reference information including PDB method and resolution, the tab-separated download only contains the identifier, and excludes the other information on these lines.

We are looking into changing this, although there are some issues (separators, line length as there are entries with more than 500 PDB cross-references, etc).