Question: Downloading all UniProt dataset
0
gravatar for srdjanmasirevic2
23 months ago by
srdjanmasirevic210 wrote:

I have to download all UniProt dataset (txt files). Does anybody know is it possible to download whole dataset with txt extension, but without all informations about protein. I only need info about ID, Name, Function, Sequence, Structure and Resolution? So it would look something like this for only one UniProt ID. Thanks in advance

ID   AQP1_HUMAN              Reviewed;         269 AA.
AC   P29972; B5BU39; E7EM69; E9PC21; F5GY19; Q8TBI5; Q8TDC1;
DE   RecName: Full=Aquaporin-1;
DE            Short=AQP-1;
DE   AltName: Full=Aquaporin-CHIP;
DE   AltName: Full=Urine water channel;
DE   AltName: Full=Water channel protein for red blood cells and kidney proximal tubule;
DR   PDB; 1FQY; X-ray; 3.80 A; A=1-269.
DR   PDB; 1H6I; X-ray; 3.54 A; A=1-269.
DR   PDB; 1IH5; X-ray; 3.70 A; A=1-269.
DR   PDB; 4CSK; X-ray; 3.28 A; A=1-269.
SQ   SEQUENCE   269 AA;  28526 MW;  BA204D82FB26352E CRC64;
     MASEFKKKLF WRAVVAEFLA TTLFVFISIG SALGFKYPVG NNQTAVQDNV KVSLAFGLSI
     ATLAQSVGHI SGAHLNPAVT LGLLLSCQIS IFRALMYIIA QCVGAIVATA ILSGITSSLT
     GNSLGRNDLA DGVNSGQGLG IEIIGTLQLV LCVLATTDRR RRDLGGSAPL AIGLSVALGH
     LLAIDYTGCG INPARSFGSA VITHNFSNHW IFWVGPFIGG ALAVLIYDFI LAPRSSDLTD
     RVKVWTSGQV EEYDLDADDI NSRVEMKPK
assembly • 569 views
ADD COMMENTlink modified 23 months ago by Elisabeth Gasteiger1.6k • written 23 months ago by srdjanmasirevic210
2
gravatar for genomax
23 months ago by
genomax72k
United States
genomax72k wrote:

Take a look at the README file at this Uniprot FTP site. Then choose the file you need.

ADD COMMENTlink modified 23 months ago • written 23 months ago by genomax72k

oh I see..Thanks man!

ADD REPLYlink written 23 months ago by srdjanmasirevic210
2
gravatar for Elisabeth Gasteiger
23 months ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

You could download the complete flat file (either from the UniProt FTP site, or from your query result page on the website) and then use grep or some scripting language to keep only these line types. If you need assistance with regular expressions to obtain exactly this data, please don't hesitate to contact the UniProt helpdesk. The flat file format is documented at http://www.uniprot.org/docs/userman.htm

UniProt offer tab-delimited download from the website (http://www.uniprot.org/help/customize, http://insideuniprot.blogspot.ch/2015_03_01_archive.html)

This would work perfectly in your case, keeping columns for identifiers, protein names, PDB cross-references and sequence. However, we do unfortunately have a limitation for tab-separated cross-reference download: While the html version of the result table contains the full cross-reference information including PDB method and resolution, the tab-separated download only contains the identifier, and excludes the other information on these lines.

We are looking into changing this, although there are some issues (separators, line length as there are entries with more than 500 PDB cross-references, etc).

ADD COMMENTlink written 23 months ago by Elisabeth Gasteiger1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2324 users visited in the last hour