Question

Short Codes Annotation Line Fasta Formatted Files From Swissprot

0

Entering edit mode

10.7 years ago

Arjen Ten Have ▴ 360

I want to make a database of HMMer profiles from the sequences of Swissprot. Swissprot comes with a shortcode that is part of the annotation line and I would like to cluster sequences based on these codes. Is there a straighforward way to obtain the sequences in separarte cluster specific files? I could write a script but this is hampered by the fact that the annotation format is not so straightforward.

annotation • 2.8k views

ADD COMMENT • link updated 10.7 years ago by Hamish ★ 3.2k • written 10.7 years ago by Arjen Ten Have ▴ 360

0

Entering edit mode

Why is the annotation format not straightforward?

Also, clustering based on the accession number seems a bit odd. To my knowledge, the accession numbers are assigned somewhat random, depending on when they were added to the database. I might be wrong, though, but I've never seen any other claim, nor could I find any documentation describing this.

(By Swissprot I assume you're referring to the Uniprot database)

ADD REPLY • link 10.7 years ago by David Westergaard ★ 1.5k

0

Entering edit mode

Yes but the Swissprot part not the TREMBL part

ADD REPLY • link 10.7 years ago by Arjen Ten Have ▴ 360

0

Entering edit mode

A bit more explanation. The swissprot annotation has an acession number, a short code, and a longer code:

gi|1345643|sp|P48419.1|C75A3_PETHY RecName: Full=Flavonoid 3',5'-hydroxylase 2; Short=F3'5'H; AltName: Full=CYPLXXVA3; AltName: Full=Cytochrome P450 75A3

In this example the short code is C75A3 and this is directly related to its biochemical function, hence does provide a quite good character to do a preliminary classification. The problem is that the amount of data prior to the short code differ, otherwise I would simply paste the fasta file in a spreddie, using the pipe as a separator. Sort, copy and download one by one (still work, but feasible).....

ADD REPLY • link 10.7 years ago by Arjen Ten Have ▴ 360

0

Entering edit mode

I see. By short code I thought you referred to the nun-human readable accession number.

I would just match that by a RegEx. As I recall, the "C75A3_PETHY" part of the annotation is always at the last pipe.

Doing it in python would be something along the line of

>>> import re
>>> x = gi|1345643|sp|P48419.1|C75A3_PETHY RecName: Full=Flavonoid 3',5'-hydroxylase 2; Short=F3'5'H; AltName: Full=CYPLXXVA3; AltName: Full=Cytochrome P450 75A3
>>>re.findall("\|(\w+)_(\w+)\s+",x)
[('C75A3', 'PETHY')]

ADD REPLY • link 10.7 years ago by David Westergaard ★ 1.5k

0

Entering edit mode

Great clue the last pipe, should be able to script this in PERL. Tx!!

ADD REPLY • link 10.7 years ago by Arjen Ten Have ▴ 360

score 0 · Answer 1 · 2013-08-08

The entry name in UniProtKB/Swiss-Prot is composed of two parts which provide an indicator of the gene symbol and the species. The first part, which provides the gene memonic, is not guaranteed to always refer to the same gene, or be the same for all instances of the gene, so I am not sure why you would want to cluster based on this?

For what it is worth, you might find this easier if you use the fasta sequence format files provided by UniProt (see http://www.uniprot.org/downloads) instead of the NCBI nr version, since these use a cleaner version of the fasta header, which makes it easier to extract the gene symbol using something like:

zcat uniprot_sprot.fasta.gz | perl -ne 'print $1, "\t", $2, "\t", $3, "\n" if(m/^>\S+\|(\w+)\|(\w+)_\w+\s+.*? GN=([^ ]+)/);'

If you don't actually need the fasta file, you could do this by steaming the data from the UniProt FTP site:

wget -q -O - ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz | zcat | perl -ne 'print $1, "\t", $2, "\t", $3, "\n" if(m/^>\S+\|(\w+)\|(\w+)_\w+\s+.*? GN=([^ ]+)/);'

This gives a three column tab-delimited table containing the UniProtKB accession, the gene memonic from the entry name and the gene symbol, for example:

Q6GZX4    001R    FV3-001R
Q6GZX3    002L    FV3-002L
Q197F8    002R    IIV3-002R
Q197F7    003L    IIV3-003L
Q6GZX2    003R    FV3-003R
Q6GZX1    004R    FV3-004R
Q197F5    005L    IIV3-005L
Q6GZX0    005R    FV3-005R
Q91G88    006L    IIV6-006L
Q6GZW9    006R    FV3-006R

In any case from you description it sounds like Pfam or UniRef are what you are looking for, since these already incorporate the clustering, and are not limited by the peculiarities of the UniProtKB entry names.