Question: download protein sequences from NCBI
0
gravatar for guillaume.rbt
2.5 years ago by
guillaume.rbt530
France
guillaume.rbt530 wrote:

Hi all,

I would like to download all protein sequences from one species on NCBI:

https://www.ncbi.nlm.nih.gov/protein?linkname=bioproject_protein&from_uid=261773

This is maybe trivial, but is there a way to download all sequences concatenated in only one fasta?

Thanks a lot,

Guillaume

protein ncbi fasta • 2.5k views
ADD COMMENTlink modified 2.5 years ago by tlorin250 • written 2.5 years ago by guillaume.rbt530
1

there is a send to option through which you can download all the sequences. After just remove the fasta headers to make a single fasta

awk 'BEGIN{a=0}{if($0~/^>/){if(a==0){print}a++;}else{print}}' input.fasta >out.fasta
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Prasad1.5k

thanks for the response, how do you use the send to option? Is this on the console or on the website?

ADD REPLYlink written 2.5 years ago by guillaume.rbt530
3
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

using the ncbi interface you can just click on "Send to > File"

or using eutils:

curl "http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=bioproject&id=261773&linkname=bioproject_protein"  | xmllint --xpath '//LinkSetDb' - | xmllint --format - | grep "<Id>" | cut -d '>' -f 2 | cut -d '<' -f 1 | while read L; do curl "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${L}&rettype=fasta" ; done
>gi|821074095|gb|KKY28990.1| putative uracil phosphoribosyltransferase [Diplodia seriata]
MFVHASGPESIKFKHLQGQVQVLLVDSVINSGATILDFVEAIREINPGIRIVVVAGTVQAQCISPNNPFY
KTLAQHGDISLVALRSSETKFTGSGGTDTGNRLFNTTHLL

>gi|821074094|gb|KKY28989.1| putative integral membrane protein [Diplodia seriata]
MPQYFPWPYSVDPLPEDLRRGLWPVGIFALMSTVATLALLCWITYRLVSWRKHYRSYVGYNQYVLLIYNL
LLADLQQSISFLISFHWIHTDSMLAPSPACFGQAWLVQIGDISSGMFVLAIALHTFFSVVKGRQIPFRAF
LIGTIVIWALALLLTVLGPALHGSDYFTAAGAWCWASDKYETERLWLHYLWIFIIEFGTVIIYALIFIYL
RKQLVSIASAHQHSTQNKVSQAARYMVLYPLTYVLLTLPLAAGRMATMTGQTLPIAYYCAAGSMMTSCGW
VDAALYALTRRVLVSNEIDQPQGGAGKGASSSGGRTGYGGHGSSHTATGWDIASFSDRKGGMGADHSVTI
TGGLDARGSNFIDMDELSKGGVHHHATERVGRPKHKGSSTPSTQGLTRARSSSTSARESTPRGSTDSILA
GLGGVRAETKVEIRVEPANGFMLPGEGSGSNGSSGMSTPNGRTVEVVGNSHAMRPRSGSPY

ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum116k
2
gravatar for Sej Modha
2.5 years ago by
Sej Modha4.0k
Glasgow, UK
Sej Modha4.0k wrote:

NCBI Unix e-utils version of the @pierre's solution

esearch -db bioproject -query 261773|elink -target protein |efetch -format fasta
ADD COMMENTlink written 2.5 years ago by Sej Modha4.0k
1
gravatar for tlorin
2.5 years ago by
tlorin250
Switzerland
tlorin250 wrote:

Here is a well-explained tutorial for your problem :)

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by tlorin250

The link you provided doesn't seem to work.

ADD REPLYlink written 2.5 years ago by Sej Modha4.0k

This should be better now, thanks!

ADD REPLYlink written 2.5 years ago by tlorin250

thank you all for your help! works fine

ADD REPLYlink written 2.5 years ago by guillaume.rbt530
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1019 users visited in the last hour