How to best get ALL Bacterial proteins from NCBI
2
0
Entering edit mode
2.8 years ago

Hey all,

I already have a head start on this question (following this tutorial.) However that method is taking a _really_ long time since I have a list of ~0.5 Billion sequences to get. Additionally, some of my threads during sequence filtering are throwing errors and I'm afraid this method might not work.

So! I'm asking you if you have a better idea on how to get every bacterial protein sequence from NCBI. I don't think Edirect will work (I'll be blocked). One idea I had was if I could use esearch and efetch on a local copy of the all protein record (nr.fa). However Edirect doesn't support local queries out of the box (at least to my knowledge).

Any advice on how to wrangle Edirect to do local queries or any other ideas would be much appreciated.

protein big data • 777 views
0
Entering edit mode

You can also download .faa.gz files for every bacterium in RefSeq, check another tutorial

0
Entering edit mode

how to get every bacterial protein sequence from NCBI

That requirement, if absolute, will not be satisfied by these two things.

0
Entering edit mode

Yes I know, I guess proteins of bacteria in RefSeq are enough for his/her purpose, before knowing for what he/she use the data.

Anyway, one can try

# downlaod
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# reformat
cat assembly_summary.txt | sed 1d | sed '1s/^# //' \
| sed 's/"/$/g' > assembly_summary.tsv # where to download dir=download mkdir -p$dir

cat assembly_summary.tsv \
| csvtk cut -t -f ftp_path | sed 1d \
| rush -v prefix='{}/{%}' -v dir=\$dir \
' \
wget -c {prefix}_protein.faa.gz -O {dir}/{%}_protein.faa.gz \
' \


0
Entering edit mode

"all protein" sequences is a moving target, anyway...

2
Entering edit mode
2.8 years ago
GenoMax 109k

You could download nr blast indexes and then use blastdbcmd from BLAST+ (v. 2.8.1) package to do something like this:

 blastdbcmd -db /path_to/nr_v5 -taxids 2 -outfmt %f -out file.fa


This may not be completely foolproof but should mostly work.

Note: You will need to get new v.5 blast indexes for this to work.

0
Entering edit mode

I may try this. I am looking for the most sequences possible right now, not just RefSeq.

0
Entering edit mode

Just occurred to me to ask: What would be the difference between the blast index filtered for bacteria and all of the RefSeq bacterial protein faa files?

1
Entering edit mode

Blast index will have data for all bacteria where as RefSeq will likely be restricted to well characterized manually curated datasets.

2
Entering edit mode
2.8 years ago
Carambakaracho ★ 2.9k

1. Contents of the /blast/db/FASTA directory

[...]

nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

Protein sequences

The protein sequences present in GenBank releases, via coding regions annotated on GenBank records, are made available via files located elsewhere at the NCBI FTP site:

These files replace the single, comprehensive protein FASTA which used to be provided in this directory ( relNNN.fsa_aa.gz ).

This is what it points to: ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/ and its README

Is this what you're looking for?

0
Entering edit mode

The gbbct* files in this directory would work but there is going to be a lot of redundancy. It may still be worth using the nr database to avoid this issue but that is something original poster will have to decide.

0
Entering edit mode

This may be a good backup to using the nr_v5 database.

0
Entering edit mode

I didn't believe it wasn't there anymore:

specifically the nr.gz file (links to 45GB file). Still requires a filter on the bacterial entries, though...