Retrieving Fasta Sequence Using Uniprot Id
5
1
Entering edit mode
7.9 years ago
bioinfo ▴ 790

Is there any way to retrieve fasta sequence using the uniprot IDs (e.g. O53166) from command line? I usually use blastcmd or fastacmd to grab the fasta sequences by using the gi's but not sure whether fastacmd or blastdbcmd also work for uniprot IDs. I checked the fastacmd documentation it says gi's, accessions, locuses can be used in the the argument (e.g. fastacmd -d database -s gi/accession/locus). I was wondering whether uniprot IDs can be uased here as accession? Alternatively, I could map the uniprot IDs to get the gi's by using the ID mapping then it would be easier for me to use fastacmd/blasdbcmd but I realised one single uniprot ID can map against 5-6 different gi's.

fasta • 11k views
ADD COMMENT
5
Entering edit mode
7.9 years ago
Hamish ★ 3.2k

The use of 'fastacmd' and 'blastdbcmd' suggests you are trying to get the UniProtKB sequences from an NCBI BLAST database. Depending on how the database was constructed look-ups using the various identifiers may or may not work.

Firstly the NCBI BLAST database needs to have been build with indexing of the sequence identifiers enabled (i.e. with '-oT' for 'formatdb' or '-parse_seqids' for 'makeblastdb'). The BLAST databases provided on the NCBI's FTP site should all have this enabled, but for other NCBI BLAST databases this may not have been enabled when the database was created.

For the 'nr' BLAST database provided by NCBI look-ups are supported using all the entry identifiers appearing in the fasta header line. So for UniProtKB:WAP_RAT the 'nr' fasta header line is:

>gi|139691|sp|P01174.2|WAP_RAT RecName: Full=Whey acidic protein; Short=WAP; AltName: Full=Whey phosphoprotein; Flags: Precursor >gi|5679681|emb|CAA25600.2| whey acidic protein [Rattus norvegicus]

Which means we can search 'nr' with:

  1. NCBI gi number:

    blastdbcmd -db nr -dbtype prot -entry '139691' -get_dups
    blastdbcmd -db nr -dbtype prot -entry '5679681' -get_dups
    
  2. UniProtKB accession:

    blastdbcmd -db nr -dbtype prot -entry 'P01174' -get_dups
    
  3. UniProtKB sequence version accession:

    blastdbcmd -db nr -dbtype prot -entry 'P01174.2' -get_dups
    
  4. UniProtKB entry name aka. UniProtKB ID:

    blastdbcmd -db nr -dbtype prot -entry 'WAP_RAT' -get_dups
    
  5. INSDC protein_id:

    blastdbcmd -db nr -dbtype prot -entry 'CAA25600' -get_dups
    

For BLAST databases which were built from fasta format data which used an alternative header format, for example a 'uniprotkb' BLAST database generated from the UniProtKB fasta files provided by EMBL-EBI (ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/) which use the fasta header format:

>SP:WAP_RAT P01174 Whey acidic protein OS=Rattus norvegicus GN=Wap PE=1 SV=2

The support for parsing the identifier in NCBI BLAST can be insufficient. In which case the entries can only be retrieved by using the generic fasta identifier (i.e. first "word" on the header line):

blastdbcmd -db uniprotkb -dbtype prot -entry 'SP:WAP_RAT' -get_dups

The 'fastacmd' program works in exactly the same way, but the command-line syntax is a little bit different, for example fetching the example sequence from above using the UniProtKB sequence version uses the command-line:

fastacmd -d nr -pT -s 'P01174.2' -aT

Note: 'fastacmd' and 'blastdbcmd' support batch retrieval using a comma separated list of identifiers, so when fetching many entries you may want to batch them for efficiency reasons. The queries above use the '-get_dups' or '-aT' to allow for cases where an identifier may correspond to multiple sequences (shouldn't happen in these databases, but you never know).

If you do not have an appropriate NCBI BLAST database for these look-ups, then web based options such as those mentioned in the other answers (e.g. UniProt.org RESTful API, EMBL-EBI dbfetch, NCBI E-utils, etc.) may be more appropriate depending on how much of the database you need. Otherwise you may want to download the data, and appropriate indexing software (e.g. NCBI BLAST, EMBOSS, BioPerl, etc.) in order to perform the look-ups locally.

ADD COMMENT
0
Entering edit mode

That's impressive. Good to see in details. Very useful answer. Really appreciate for that.

ADD REPLY
0
Entering edit mode

Great answer. Is there a built-in way to limit the search to only the initial gi? e.g. in your example above, retrieve the FASTA entry via 

blastdbcmd -db nr -dbtype prot -entry '139691' -get_dups

but not by:

blastdbcmd -db nr -dbtype prot -entry '5679681' -get_dups
ADD REPLY
4
Entering edit mode
7.9 years ago
sarahhunter ▴ 600

Uniprot.org uses RESTful URLs so you can use wget to retrieve information like so:

wget -nv http://www.uniprot.org/uniprot/O53166.fasta

I would never use gi numbers over accessions - they aren't stable!

ADD COMMENT
1
Entering edit mode

For a complete description of the UniProt.org RESTful API see: http://www.uniprot.org/faq/28. The API includes support for batch retrieval of entries.

UniProt recommend the use of accession numbers where ever possible. UniProtKB names, often called UniProtKB IDs since they appear on the 'ID' line of the flatfile format, (e.g. WAP_RAT) are subject to change, whereas the accession (e.g. P01174) is stable and is maintained through changes to the entry including merges and splits.

ADD REPLY
1
Entering edit mode
7.9 years ago

I'd use dbfetch from the EBI to get the fasta sequences if I had UniProt IDs.

http://www.ebi.ac.uk/Tools/dbfetch/

Similar to the previous suggestion of just calling the url, but you can request up to 200 ids at a time.

ADD COMMENT
1
Entering edit mode

For details of the various URL formats supported by dbfetch see: http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp

For sample client code using dbfetch see: http://www.ebi.ac.uk/Tools/webservices/services/dbfetch_rest, these provide some additional support for the meta-data provided by dbfetch. Dbfetch also is supported by various other tools including EMBOSS, BioJava, BioPerl and BioRuby.

ADD REPLY
0
Entering edit mode
5.4 years ago
Kurban ▴ 200

hey guys,

I saw the thread is little old but I wanna ask a question about uniprot fasta file header.

in the header shown as below:

kurban@kurban-X550VC:~/Desktop/Uniprot$ zcat uniprot_sprot.fasta.gz | more
>sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS
EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD
AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL
EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD
SFRKIYTDLGWKFTPL

kurban@kurban-X550VC:~/Desktop/Uniprot$ zcat uniprot_trembl.fasta.gz | more
>tr|G9CT51|G9CT51_9ARCH Ammonia monooxygenase (Fragment) OS=uncultured ammonia-oxidizing archaeon GN=amoA PE=4 SV=1
CTHYLFIVVVAVNSTLLTINAGDYIFYTDWAWTSFTVFSISQTLMLIVGACYYLTFTGVP
GTATYYALIMTVYTWVAKAAWFSLGYPYDFIVTPVWLPSAMLLDLVYWATKKNKHSLILF
GGVLVGMSLPLFNMVNLITVADPLETAFKYPRPTLPPYMTPIEPQVGKFYNSPVALGAGA
GAVLGCTFAALGCKLNT

which ones are UniProt IDs and which ones are accession numbers?

ADD COMMENT
0
Entering edit mode

To get input, move your post as a new, separate question.

ADD REPLY

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6