Question: How to download large protein data from NCBI?
0
gravatar for enriqp02
2.3 years ago by
enriqp0210
enriqp0210 wrote:

Hello everybody!

I'm working with metaproteomics samples and I want to use different search engines to look for all the proteins. In order to do this, I need a good protein databases. I tried to download from NCBI webpage, but the size of the dataset I want to get (all bacteria proteins) is too large to download via the web. There is also no pre-created set available for downloading via FTP. I found on the internet this script in Perl... But the time is required to download take months.

#!/usr/bin/perl -w
#Based on www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl  
#Usage: perl ncbi_fetch.pl > output_file


use LWP::Simple;
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = ask_user("Database", "nuccore|nucest|protein|pubmed");
my $query = ask_user("Query", "Entrez query");
my $report = ask_user("Report", "fasta|genbank|abstract|acc");
my $esearch = "$utils/esearch.fcgi?" ."db=$db&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1;
my $QueryKey = $2;
my $WebEnv = $3;
print STDERR "Count = $Count; QueryKey = $QueryKey; WebEnv = $WebEnv\n";
my $retstart=0;
my $retmax=100000;
while ($retstart<$Count) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
print STDERR "Donwloading database $retstart / $Count\n";

my $efetch_result = get($efetch);
my $copy=$efetch_result;
my $countSeqs=0;
if ($report eq 'fasta') {
    $countSeqs= $copy =~ tr/\>//;
    }
elsif ($report eq 'genbank') {
    $countSeqs = $copy =~ tr/\/\///;
} 
elsif ($report eq 'acc') {
    $countSeqs = $copy =~ tr/\n//;
}

my $expected=$retmax;
if ($retstart>$Count-$retmax) {
    $expected=$Count-$retstart;
}

if ($countSeqs>=($expected-100000)) {
    print "$efetch_result";
    $retstart+=$retmax;
} 
else {
    print STDERR "ERROR...TRYING AGAIN ($countSeqs / $expected)\n";
    }
}

sub ask_user {
print STDERR "$_[0] [$_[1]]: ";
my $rc = <>;
chomp $rc;
if($rc eq "") {
    die "Error: Empty field: $_[0]\n";
}
    return $rc;
}

Do you know another way to do this?

Thanks in advance.

download proteins database ncbi • 1.1k views
ADD COMMENTlink modified 2.3 years ago by tdmurphy190 • written 2.3 years ago by enriqp0210

This Python script that I wrote downloads the FASTA sequence of all proteins matching a keyword, across all species. It is configurable, though. See if you can avail of it: A: How to download all sequences of a list of proteins for a particular organism

Edit: You are looking for the actual amino acid sequence, I presume?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Kevin Blighe66k

Thank you! I used your script and works perfectly. But one question, if I type in NCBI: human, there are many values that I'm not interested.

In this case: Animals(1,419,740) Plants(4,494) Fungi(898,540) Protists(203,856) Bacteria(84,639,903) Archaea(6,043) Viruses(1,749,114)

In my case, I would like to use: "Homo sapiens"[Organism] but in this case, your script doesn't work. Is there any solution for this?

Thanks again

ADD REPLYlink written 2.3 years ago by enriqp0210

I could see as well that only 20 sequences are downloaded in human :S.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by enriqp0210

Any solution to this problem?

ADD REPLYlink written 2.3 years ago by enriqp0210

Yes, for human data, just replace this line:

LookupCommand = "refseq[FILTER] AND " + SearchTerm + "[TITL]"

...with this:

LookupCommand = "refseq[FILTER] AND txid9606[Organism] AND " + SearchTerm + "[TITL]"

txid9606 is a reference to Homo sapiens

ADD REPLYlink written 2.3 years ago by Kevin Blighe66k

Thanks for your reply.

I still having the same problem. Only download 20 sequences, as they appear on the website.

ADD REPLYlink written 2.3 years ago by enriqp0210
0
gravatar for genomax
2.3 years ago by
genomax91k
United States
genomax91k wrote:

You can use NCBI eUtils which you will need to download and install.

 esearch -db protein -query "2[taxid] AND refseq [filter]" | efetch -format acc

(change acc to fasta if you want to get actual sequence). We are using 2 as taxID which is for bacteria (refine as needed) along with refseq as filter to get curated proteins only. If you replace 2 with 9606 you till get the same information for humans.

As for time there is not much you can do. This is going to depend on your internet speed (NCBI has plenty of bandwidth on their end). You could retrieve the proteins from blast indexes (nr) if you have those available. But that is still a large download so if you don't have those indexes it make take the same or longer as far as time goes.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by genomax91k
ProteomicaVI@DESKTOP-GTTSG80:~/Base_de_datos_Lola$  esearch -db Protein -query "9606[taxid] AND refseq [filter]" | efetch -format fasta
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain'
Result of do_post http request is
$VAR1 = bless( {
                 '_rc' => 501,
                 '_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
                 '_msg' => 'Protocol scheme \'https\' is not supported (LWP::Protocol::https not installed)',
                 '_headers' => bless( {
                                        'content-type' => 'text/plain',
                                        'client-warning' => 'Internal response',
                                        '::std_case' => {
                                                          'client-warning' => 'Client-Warning',
                                                          'client-date' => 'Client-Date'
                                                        },
                                        'client-date' => 'Thu, 28 Jun 2018 12:07:46 GMT'
                                      }, 'HTTP::Headers' ),
                 '_request' => bless( {
                                        '_uri' => bless( do{\(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi')}, 'URI::https' ),
                                        '_content' => 'db=protein&term=9606%5Btaxid%5D%20AND%20refseq%20%5Bfilter%5D&retmax=0&usehistory=y&edirect_os=linux&edirect=9.20&tool=edirect&email=ProteomicaVI@DESKTOP-GTTSG80.localdomain',
                                        '_method' => 'POST',
                                        '_headers' => bless( {
                                                               'content-type' => 'application/x-www-form-urlencoded',
                                                               'user-agent' => 'libwww-perl/6.34'
                                                             }, 'HTTP::Headers' )
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

WebEnv value not found in search output - WebEnv1
Db value not found in fetch input

I followed your steps but I got this bug.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by enriqp0210

You need to install LWP::Protocol::https. If you have permissions to install things on this machine then follow the directions here. Use the appropriate one for OS you are using.

ADD REPLYlink written 2.3 years ago by genomax91k

I am not sure if this tool would exclude the protein data but maybe you can try https://github.com/kblin/ncbi-genome-download

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Sej Modha4.7k
0
gravatar for tdmurphy
2.3 years ago by
tdmurphy190
tdmurphy190 wrote:

NCBI RefSeq includes nearly all bacteria proteins, and has files available for download at: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

ADD COMMENTlink written 2.3 years ago by tdmurphy190

Thanks a lot.

Do you know what file it is? There are many protein.faa

ADD REPLYlink written 2.3 years ago by enriqp0210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour