How can I download all protein sequences of complete genome sequences of Acinetobacter baumannii from ncbi ftp site?
See my answer here: A: How to extract Refseq of downloaded files from NCBI
OP needs to get *protein.faa.gz
files since protein data is needed.
OP take a look at the help for ncbi-genome-download
. Give the option --format protein-fasta
to get what you want.
(or download the genome or CDS data and tranform it yourself)
I am running this command
ncbi-genome-download -l complete,chromosome bacteria --genus "Acinetobacter baumannii" --format protein-fasta
but this gives me MD5SUMS file names like this. I need fasta sequnces.
260ac38772d1f9d98641f03bc5b07596 ./GCF_000018445.1_ASM1844v1_assembly_report.txt
d3b3df68700a410823ff5ab347294110 ./GCF_000018445.1_ASM1844v1_assembly_stats.txt
3c329eae370e70cb5fe3d318944ff2a9 ./GCF_000018445.1_ASM1844v1_cds_from_genomic.fna.gz
283123b31bc184dad8a5112758c3dac8 ./GCF_000018445.1_ASM1844v1_feature_count.txt.gz
3b2c5e5971cf64dec0cbc9b4105e4723 ./GCF_000018445.1_ASM1844v1_feature_table.txt.gz
21d351875d083b9d039e5152ee386b85 ./GCF_000018445.1_ASM1844v1_genomic.fna.gz
44f267ac471a1a751e007a77f2be976f ./GCF_000018445.1_ASM1844v1_genomic.gbff.gz
8ab9cd32a6125e45e478315c4e933905 ./GCF_000018445.1_ASM1844v1_genomic.gff.gz
19a54b30d9fcdc1bdff15dd57d3ebe53 ./GCF_000018445.1_ASM1844v1_protein.faa.gz
f1216342941f7ec20fc52d35391c7a98 ./GCF_000018445.1_ASM1844v1_protein.gpff.gz
4ec4c663b32630249858689a42609eac ./GCF_000018445.1_ASM1844v1_rna_from_genomic.fna.gz
f0cff22a6c824dc98013967ecfe8a418 ./GCF_000018445.1_ASM1844v1_translated_cds.faa.gz
b70f0ea964ce5c4f79deca5b287919f1 ./annotation_hashes.txt
The MD5sums are always provided. They correspond to the files you need which should be present in a folder named GCF_000...
.
You command is also wrong. complete,chromosome
is not one argument to the --assembly-level
option. You should specify one or the other. Similarly, bacteria
is also a positional argument and should come last in the command.
Make sure you read the documentation on the github page.
Try:
ncbi-genome-download -s refseq -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria
or
ncbi-genome-download -s genbank -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria
I got:
$ ls /refseq/bacteria
GCF_000018445.1:
GCF_000018445.1_ASM1844v1_protein.faa.gz MD5SUMS
GCF_000021145.1:
GCF_000021145.1_ASM2114v1_protein.faa.gz MD5SUMS
GCF_000021245.2:
GCF_000021245.2_ASM2124v2_protein.faa.gz MD5SUMS
GCF_000069245.1:
GCF_000069245.1_ASM6924v1_protein.faa.gz MD5SUMS
GCF_000186665.3:
GCF_000186665.3_ASM18666v4_protein.faa.gz MD5SUMS
GCF_000187205.2:
GCF_000187205.2_ASM18720v4_protein.faa.gz MD5SUMS
...
...
You may still end up with some empty folders, so you'll need to pull out all the fasta files seperately after with something like find ./ -name "*.faa.gz"
Hi Joe hope you are save and well.
Why this works:
ncbi-genome-download -n -s refseq bacteria --genera Zhihengliuella -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10
INFO: Using cached summary.
Considering the following 1 assemblies for download:
GCF_002848265.1 Zhihengliuella sp. ISTPL4 ISTPL4
And this script dont?
echo "Downloading genomes from NCBI"
input="bac_taxa.txt"
while IFS= read -r line
do
mkdir $line
cd $line
echo "Downloading $line genomes from NCBI"
ncbi-genome-download -n -s refseq bacteria --genera $line -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10
cd ..
done < "$input"
Downloading Acidiphilium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipila genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipropionibacterium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisarcina genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisoma genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisphaera genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithiobacillus genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithrix genomes from NCBI
Unsupported assembly level: cds-fasta
My list of genera(example):
Acidovorax
Acinetobacter
Acrocarpospora
Actibacterium
Actinoallomurus
Actinoalloteichus
Actinobacillus
Actinobacteria
actinobacterium
Actinobaculum
Actinocatenispora
Actinocorallia
Actinocrispum
Actinokineospora
Actinomadura
I used to download all genomic fasta and works just fine! Thanks