Question: How to retrieve sets of protein sequences?
0
gravatar for Learner
5 months ago by
Learner 160
Learner 160 wrote:

I have a set of protein accession number . I want to retrieve their sequences but programmically Is there anyway to do that from Uniprot. There is a page description on Uniprot itself but to be honest I could not understand it

Any comment would be appreciated Thanks

gene • 316 views
ADD COMMENTlink modified 5 months ago by vkkodali1.1k • written 5 months ago by Learner 160

There is a page description on Uniprot itself but to be honest I could not understand it

Please link to the page and specify exactly what you do not understand.

ADD REPLYlink written 5 months ago by RamRS21k
2
gravatar for vkkodali
5 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

You can get it from UniProt directly using curl as follows:

$ cat uniprot_ids.txt 
P00750
P00751
P00752

$ for acc in `cat uniprot_ids.txt` ; do curl -s "https://www.uniprot.org/uniprot/$acc.fasta" ; done > uniprot_seqs.fasta

But if you choose to go with Entrez Direct, then I suggest the following command:

$ cat uniprot_ids.txt | epost -db protein | efetch -db protein -format fasta > uniprot_seqs.fasta
ADD COMMENTlink written 5 months ago by vkkodali1.1k

@vkkodali how can I know which proteins sequences are downloaded and which ones are not? also should I install epost and efetch ? because it gives me an error

ADD REPLYlink written 5 months ago by Learner 160

how can I know which proteins sequences are downloaded and which ones are not?

You will have to use a unix tool like grep to check which IDs are in the fasta file and which ones are missing.

should I install epost and efetch ?

If you have followed the instructions at http://bit.ly/entrez-direct then you should have access to all 9 of the Entrez Direct tools. Make sure you have the edirect tools in your path (these are the last two export commands in the installation instructions). You will have to logout of the terminal and log back in for that to take effect.

Finally, if you see errors related to API keys, see https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/ to create your own API key and use it with the command line tools by setting an environment variable NCBI_API_KEY as follows at the bash prompt:

export NCBI_API_KEY=12345
ADD REPLYlink written 5 months ago by vkkodali1.1k

@vkkodali actually I am using Uniprot which is what I am interested in , however, it is killing me, for example check for these gene P21333 O43707 P68363

you see that they exist in Uniprot but when I use your approach , I get nothing. do you think the format of .txt would influence ?

also is there a possibility to check all gene list to the fasta file with grep?

ADD REPLYlink written 5 months ago by Learner 160

I am able to download the FASTA sequences for all three of those IDs without an issue. How did you create the uniprot_ids.txt file? If you have created it on Windows then you may want to run dos2unix on that file to make sure it's not the line endings that are mucking things up.

For checking whether all of the IDs from uniprot_ids.txt are present in unipro_seqs.fasta file, you can run the following command:

comm -23 <(sort uniprot_ids.txt ) <(grep '^>' uniprot_seqs.fasta | cut -f2 -d '|' | sort )

All accessions that are present in your original list and not in the FASTA file will be returned.

ADD REPLYlink written 5 months ago by vkkodali1.1k

@vkkodali I found where the issue was. your solution is the best solution.

I think if I want to extract the IDs from the fasta file, I can simple do the following right ?

grep '^>' uniprot_seqs.fasta | cut -f2 -d '|' | sort
ADD REPLYlink modified 5 months ago • written 5 months ago by Learner 160

If you just want to extract the IDs and not care about the order they are in then you can skip the sort at the end. Also, if you skip the sort at the end then the IDs will be returned in the same order they appear in the FASTA file.

ADD REPLYlink written 5 months ago by vkkodali1.1k
0
gravatar for genomax
5 months ago by
genomax67k
United States
genomax67k wrote:

Simplest solution may be to use NCBI's unix utils. Pass in your ID's (example below are UniProt ID's) one at a time or a batch as follows.

$ efetch -db protein -id "P00750,P00751,P00752" -format fasta
ADD COMMENTlink modified 5 months ago • written 5 months ago by genomax67k
0
gravatar for piyushjo
5 months ago by
piyushjo110
piyushjo110 wrote:

Ensembl biomart. It takes any sort of id: Refseq, HGNC, or just gene name. http://useast.ensembl.org/biomart/martview/bfc7092adedc70231fd4027a5b8eaaed

1) Choose data set (Ensembl v94)

2) Filters: Gene name, ensemble id, refseq id or anything

3) Attributes: In features choose "gene id" or "gene name" and then in sequences choose "peptide"

4) hit results

ADD COMMENTlink modified 5 months ago • written 5 months ago by piyushjo110

It's Ensembl, there's no e at the end.

ADD REPLYlink written 5 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1499 users visited in the last hour