Question

How to best get ALL Bacterial proteins from NCBI

0

Entering edit mode

5.1 years ago

A Soggy Waffle • 0

Hey all,

I already have a head start on this question (following this tutorial.) However that method is taking a _really_ long time since I have a list of ~0.5 Billion sequences to get. Additionally, some of my threads during sequence filtering are throwing errors and I'm afraid this method might not work.

So! I'm asking you if you have a better idea on how to get every bacterial protein sequence from NCBI. I don't think Edirect will work (I'll be blocked). One idea I had was if I could use esearch and efetch on a local copy of the all protein record (nr.fa). However Edirect doesn't support local queries out of the box (at least to my knowledge).

Any advice on how to wrangle Edirect to do local queries or any other ideas would be much appreciated.

protein big data • 1.7k views

ADD COMMENT • link updated 5.1 years ago by Carambakaracho ★ 3.2k • written 5.1 years ago by A Soggy Waffle • 0

0

Entering edit mode

You can also download .faa.gz files for every bacterium in RefSeq, check another tutorial

ADD REPLY • link 5.1 years ago by shenwei356 8.4k

0

Entering edit mode

how to get every bacterial protein sequence from NCBI

That requirement, if absolute, will not be satisfied by these two things.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Yes I know, I guess proteins of bacteria in RefSeq are enough for his/her purpose, before knowing for what he/she use the data.

Anyway, one can try

# downlaod
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# reformat
cat assembly_summary.txt | sed 1d | sed '1s/^# //' \
    | sed 's/"/$/g' > assembly_summary.tsv

# where to download
dir=download
mkdir -p $dir

cat assembly_summary.tsv \
    | csvtk cut -t -f ftp_path | sed 1d \
    | rush -v prefix='{}/{%}' -v dir=$dir \
        ' \
            wget -c {prefix}_protein.faa.gz -O {dir}/{%}_protein.faa.gz \
        ' \
        -j 10 -c -C download.rush

ADD REPLY • link 5.1 years ago by shenwei356 8.4k

0

Entering edit mode

"all protein" sequences is a moving target, anyway...

ADD REPLY • link 5.1 years ago by Carambakaracho ★ 3.2k

score 2 · Answer 1 · 2019-02-28

2

Entering edit mode

5.1 years ago

GenoMax 141k

You could download nr blast indexes and then use blastdbcmd from BLAST+ (v. 2.8.1) package to do something like this:

 blastdbcmd -db /path_to/nr_v5 -taxids 2 -outfmt %f -out file.fa

This may not be completely foolproof but should mostly work.

Note: You will need to get new v.5 blast indexes for this to work.

ADD COMMENT • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

I may try this. I am looking for the most sequences possible right now, not just RefSeq.

ADD REPLY • link 5.1 years ago by A Soggy Waffle • 0

0

Entering edit mode

Just occurred to me to ask: What would be the difference between the blast index filtered for bacteria and all of the RefSeq bacterial protein faa files?

ADD REPLY • link 5.1 years ago by A Soggy Waffle • 0

1

Entering edit mode

Blast index will have data for all bacteria where as RefSeq will likely be restricted to well characterized manually curated datasets.

ADD REPLY • link 5.1 years ago by GenoMax 141k

score 2 · Answer 2 · 2019-02-28

2

Entering edit mode

5.1 years ago

Carambakaracho ★ 3.2k

From blast/db/README

Contents of the /blast/db/FASTA directory

[...]

nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

From README.genbank

Protein sequences

The protein sequences present in GenBank releases, via coding regions annotated on GenBank records, are made available via files located elsewhere at the NCBI FTP site:

FTP Site: ftp.ncbi.nih.gov

Directory: ncbi-asn1/protein_fasta

URL: ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta

These files replace the single, comprehensive protein FASTA which used to be provided in this directory ( relNNN.fsa_aa.gz ).

Please see the README in the /protein_fasta directory for further information.

This is what it points to: ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/ and its README

Is this what you're looking for?

ADD COMMENT • link 5.1 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

The gbbct* files in this directory would work but there is going to be a lot of redundancy. It may still be worth using the nr database to avoid this issue but that is something original poster will have to decide.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

This may be a good backup to using the nr_v5 database.

ADD REPLY • link 5.1 years ago by A Soggy Waffle • 0

0

Entering edit mode

I didn't believe it wasn't there anymore:

ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

specifically the nr.gz file (links to 45GB file). Still requires a filter on the bacterial entries, though...

ADD REPLY • link 5.1 years ago by Carambakaracho ★ 3.2k