Question: How to download large protein data from NCBI?
0
gravatar for enriqp02
13 months ago by
enriqp020
enriqp020 wrote:

Hello everybody!

I'm working with metaproteomics samples and I want to use different search engines to look for all the proteins. In order to do this, I need a good protein databases. I tried to download from NCBI webpage, but the size of the dataset I want to get (all bacteria proteins) is too large to download via the web. There is also no pre-created set available for downloading via FTP. I found on the internet this script in Perl... But the time is required to download take months.

#!/usr/bin/perl -w
#Based on www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl  
#Usage: perl ncbi_fetch.pl > output_file


use LWP::Simple;
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" ."db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;
print STDERR "Count = $Count; QueryKey = $QueryKey; WebEnv = $WebEnv\n";
my $retstart=0;
my $retmax=100000;
while ($retstart<$Count) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
print STDERR "Donwloading database $retstart / $Count\n";

my $efetch_result = get($efetch);
my $copy=$efetch_result;
my $countSeqs=0;
if ($report eq 'fasta') {
    $countSeqs= $copy =~ tr/\>//;
    }
elsif ($report eq 'genbank') {
    $countSeqs = $copy =~ tr/\/\///;
} 
elsif ($report eq 'acc') {
    $countSeqs = $copy =~ tr/\n//;
}

my $expected=$retmax;
if ($retstart>$Count-$retmax) {
    $expected=$Count-$retstart;
}

if ($countSeqs>=($expected-100000)) {
    print "$efetch_result";
    $retstart+=$retmax;
} 
else {
    print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n";
    }
}

sub ask_user {
print STDERR "$_[0] [$_[1]]: ";
my $rc = <>;
chomp $rc;
if($rc eq "") {
    die "Error: Empty field: $_[0]\n";
}
    return $rc;
}

Do you know another way to do this?

Thanks in advance.

download proteins database ncbi • 675 views
ADD COMMENTlink modified 13 months ago by tdmurphy160 • written 13 months ago by enriqp020

This Python script that I wrote downloads the FASTA sequence of all proteins matching a keyword, across all species. It is configurable, though. See if you can avail of it: A: How to download all sequences of a list of proteins for a particular organism

Edit: You are looking for the actual amino acid sequence, I presume?

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe46k

Thank you! I used your script and works perfectly. But one question, if I type in NCBI: human, there are many values that I'm not interested.

In this case: Animals(1,419,740) Plants(4,494) Fungi(898,540) Protists(203,856) Bacteria(84,639,903) Archaea(6,043) Viruses(1,749,114)

In my case, I would like to use: "Homo sapiens"[Organism] but in this case, your script doesn't work. Is there any solution for this?

Thanks again

ADD REPLYlink written 13 months ago by enriqp020

I could see as well that only 20 sequences are downloaded in human :S.

ADD REPLYlink modified 13 months ago • written 13 months ago by enriqp020

Any solution to this problem?

ADD REPLYlink written 13 months ago by enriqp020

Yes, for human data, just replace this line:

LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

...with this:

LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"

txid9606 is a reference to Homo sapiens

ADD REPLYlink written 13 months ago by Kevin Blighe46k

Thanks for your reply.

I still having the same problem. Only download 20 sequences, as they appear on the website.

ADD REPLYlink written 13 months ago by enriqp020
0
gravatar for genomax
13 months ago by
genomax70k
United States
genomax70k wrote:

You can use NCBI eUtils which you will need to download and install.

 esearch -db protein -query "2[taxid] AND refseq [filter]" | efetch -format acc

(change acc to fasta if you want to get actual sequence). We are using 2 as taxID which is for bacteria (refine as needed) along with refseq as filter to get curated proteins only. If you replace 2 with 9606 you till get the same information for humans.

As for time there is not much you can do. This is going to depend on your internet speed (NCBI has plenty of bandwidth on their end). You could retrieve the proteins from blast indexes (nr) if you have those available. But that is still a large download so if you don't have those indexes it make take the same or longer as far as time goes.

ADD COMMENTlink modified 13 months ago • written 13 months ago by genomax70k
ProteomicaVI@DESKTOP-GTTSG80:~/Base_de_datos_Lola$  esearch -db Protein -query "9606[taxid] AND refseq [filter]" | efetch -format fasta
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain'
Result of do_post http request is
$VAR1 = bless( {
                 '_rc' => 501,
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        'content-type' => 'text/plain',
                                        'client-warning' => 'Internal response',
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        },
                                        'client-date' => 'Thu, 28 Jun 2018 12:07:46 GMT'
                                      }, 'HTTP::Headers' ),
                 '_request' => bless( {
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi')}, 'URI::https' ),
                                        '_content' => 'db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain',
                                        '_method' => 'POST',
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.34'
                                                             }, 'HTTP::Headers' )
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

WebEnv value not found in search output - WebEnv1
Db value not found in fetch input

I followed your steps but I got this bug.

ADD REPLYlink modified 13 months ago • written 13 months ago by enriqp020

You need to install LWP::Protocol::https. If you have permissions to install things on this machine then follow the directions here. Use the appropriate one for OS you are using.

ADD REPLYlink written 13 months ago by genomax70k

I am not sure if this tool would exclude the protein data but maybe you can try https://github.com/kblin/ncbi-genome-download

ADD REPLYlink modified 13 months ago • written 13 months ago by Sej Modha4.3k
0
gravatar for tdmurphy
13 months ago by
tdmurphy160
tdmurphy160 wrote:

NCBI RefSeq includes nearly all bacteria proteins, and has files available for download at: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

ADD COMMENTlink written 13 months ago by tdmurphy160

Thanks a lot.

Do you know what file it is? There are many protein.faa

ADD REPLYlink written 13 months ago by enriqp020
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1852 users visited in the last hour