Question

Parsing uniprot .dat files

0

Entering edit mode

7.7 years ago

Biogeek ▴ 470

I just downloaded the .dat.gz files and gunziped them. I am now wondering how I can obtain the .fasta sequences for all the sequences within and then in a seperate file; also all of the useful info like associated GO terms, gene names, IPR terms etc.

How do people normally do this?

Thanks.

Annotation uniprot parsing • 4.2k views

ADD COMMENT • link updated 7.6 years ago by Elisabeth Gasteiger ★ 2.4k • written 7.7 years ago by Biogeek ▴ 470

0

Entering edit mode

Check the README file and then download the data from correct folders here.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Yes, I already read; however, them .dat files are flat files. I don't really have much of a clue how to sort them into .fasta file for sequences and another tab delimited file for associated info.

ADD REPLY • link 7.7 years ago by Biogeek ▴ 470

0

Entering edit mode

That is the point. I am not sure why you got the .dat files when the files you want are in a different directory

1) Directory /current_release/knowledgebase

subdirectory /complete: This directory contains the four-weekly updates of the UniProt Knowledgebase, consisting of UniProtKB/Swiss-Prot (fully annotated curated entries) and UniProtKB/TrEMBL (computer-generated entries enriched with automated classification and annotation). Both, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, are available separately in flat file, XML and FASTA format.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

These are the .fasta files for the complete DB. I am just after plants. Are you familiar with uniprot, if so, is there a difference in downloading the files on the ftp server and doing a query search on the website. Using a query and downloading all of the viridiplantae taxonomy?

I wonder if there are differences between the two.

ADD REPLY • link 7.7 years ago by Biogeek ▴ 470

0

Entering edit mode

The difference between the two is just as you describe it. You would need to do additional work to parse things you need from the complete DB where as a query on the site does that for you.

A search via web only allows you to select 400 entries at a time so unless you have a ton of patience your only option is to get the full database and parse the data yourself.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

I was able to do a query and download the 3 million odd sequences of viridiplantae at once.. handy.

HOWEVER... when i COMPARED the query search sequences to the actual viridiplantae taxonomic divisions flat files, extra taxa where included such as plant-associated pathogen taxa, rhodophyta etc...

So both are different.

ADD REPLY • link 7.7 years ago by Biogeek ▴ 470

0

Entering edit mode

to be clear, more sequences are available through the flat files under ftp taxanomic divisions...

ADD REPLY • link 7.7 years ago by Biogeek ▴ 470

score 1 · Answer 1 · 2016-09-13

1

Entering edit mode

7.6 years ago

Elisabeth Gasteiger ★ 2.4k

The difference between the .dat file downloaded from the FTP server and the query taxonomy:viridiplantae on the website is the following:

The file on the FTP server also includes various taxonomic notes for organisms that undergo photosynthesis, in particular:

Rhodophyta
Cryptophyta
Glaucocystophyceae
Haptophyceae
Stramenopiles
Euglenida
Chlorarachniophyceae
Dinophyceae

Thanks for pointing out this discrepancy. We will try to improve consistency / documentation regarding this issue.

ADD COMMENT • link 7.6 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Hi Elisabeth, I agree. I managed to parse the files from the FTP server using a combo of swissknife and biopython and grep/sed. I ended up going for the full uniprot database download :-) Much better annotation and helped me to weed out contaminants.

ADD REPLY • link 7.6 years ago by Biogeek ▴ 470