Question: Parsing uniprot .dat files
gravatar for Biogeek
4.3 years ago by
Biogeek400 wrote:

I just downloaded the .dat.gz files and gunziped them. I am now wondering how I can obtain the .fasta sequences for all the sequences within and then in a seperate file; also all of the useful info like associated GO terms, gene names, IPR terms etc.

How do people normally do this?


parsing uniprot annotation • 2.3k views
ADD COMMENTlink modified 4.2 years ago by Elisabeth Gasteiger1.8k • written 4.3 years ago by Biogeek400

Check the README file and then download the data from correct folders here.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by GenoMax92k

Yes, I already read; however, them .dat files are flat files. I don't really have much of a clue how to sort them into .fasta file for sequences and another tab delimited file for associated info.

ADD REPLYlink written 4.3 years ago by Biogeek400

That is the point. I am not sure why you got the .dat files when the files you want are in a different directory

1) Directory /current_release/knowledgebase

subdirectory /complete: This directory contains the four-weekly updates of the UniProt Knowledgebase, consisting of UniProtKB/Swiss-Prot (fully annotated curated entries) and UniProtKB/TrEMBL (computer-generated entries enriched with automated classification and annotation). Both, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, are available separately in flat file, XML and FASTA format.

ADD REPLYlink written 4.3 years ago by GenoMax92k

These are the .fasta files for the complete DB. I am just after plants. Are you familiar with uniprot, if so, is there a difference in downloading the files on the ftp server and doing a query search on the website. Using a query and downloading all of the viridiplantae taxonomy?

I wonder if there are differences between the two.

ADD REPLYlink written 4.3 years ago by Biogeek400

The difference between the two is just as you describe it. You would need to do additional work to parse things you need from the complete DB where as a query on the site does that for you.

A search via web only allows you to select 400 entries at a time so unless you have a ton of patience your only option is to get the full database and parse the data yourself.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by GenoMax92k

I was able to do a query and download the 3 million odd sequences of viridiplantae at once.. handy.

HOWEVER... when i COMPARED the query search sequences to the actual viridiplantae taxonomic divisions flat files, extra taxa where included such as plant-associated pathogen taxa, rhodophyta etc...

So both are different.

ADD REPLYlink written 4.3 years ago by Biogeek400

to be clear, more sequences are available through the flat files under ftp taxanomic divisions...

ADD REPLYlink written 4.3 years ago by Biogeek400
gravatar for Elisabeth Gasteiger
4.2 years ago by
Elisabeth Gasteiger1.8k wrote:

The difference between the .dat file downloaded from the FTP server and the query taxonomy:viridiplantae on the website is the following:

The file on the FTP server also includes various taxonomic notes for organisms that undergo photosynthesis, in particular:


Thanks for pointing out this discrepancy. We will try to improve consistency / documentation regarding this issue.

ADD COMMENTlink written 4.2 years ago by Elisabeth Gasteiger1.8k

Hi Elisabeth, I agree. I managed to parse the files from the FTP server using a combo of swissknife and biopython and grep/sed. I ended up going for the full uniprot database download :-) Much better annotation and helped me to weed out contaminants.

ADD REPLYlink written 4.2 years ago by Biogeek400
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1360 users visited in the last hour