Question: how to download all protein sequnces of a bacteria using ncbi ftp site?
0
gravatar for sharmatina189059
2.4 years ago by
United States
sharmatina18905940 wrote:

How can I download all protein sequences of complete genome sequences of Acinetobacter baumannii from ncbi ftp site?

ncbi • 1.4k views
ADD COMMENTlink modified 2.4 years ago by Joe18k • written 2.4 years ago by sharmatina18905940
1
gravatar for Joe
2.4 years ago by
Joe18k
United Kingdom
Joe18k wrote:

See my answer here: A: How to extract Refseq of downloaded files from NCBI

ADD COMMENTlink written 2.4 years ago by Joe18k

OP needs to get *protein.faa.gz files since protein data is needed.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by GenoMax95k
1

OP take a look at the help for ncbi-genome-download. Give the option --format protein-fasta to get what you want.

(or download the genome or CDS data and tranform it yourself)

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Joe18k

I am running this command ncbi-genome-download -l complete,chromosome bacteria --genus "Acinetobacter baumannii" --format protein-fasta

but this gives me MD5SUMS file names like this. I need fasta sequnces.

260ac38772d1f9d98641f03bc5b07596  ./GCF_000018445.1_ASM1844v1_assembly_report.txt
d3b3df68700a410823ff5ab347294110  ./GCF_000018445.1_ASM1844v1_assembly_stats.txt
3c329eae370e70cb5fe3d318944ff2a9  ./GCF_000018445.1_ASM1844v1_cds_from_genomic.fna.gz
283123b31bc184dad8a5112758c3dac8  ./GCF_000018445.1_ASM1844v1_feature_count.txt.gz
3b2c5e5971cf64dec0cbc9b4105e4723  ./GCF_000018445.1_ASM1844v1_feature_table.txt.gz
21d351875d083b9d039e5152ee386b85  ./GCF_000018445.1_ASM1844v1_genomic.fna.gz
44f267ac471a1a751e007a77f2be976f  ./GCF_000018445.1_ASM1844v1_genomic.gbff.gz
8ab9cd32a6125e45e478315c4e933905  ./GCF_000018445.1_ASM1844v1_genomic.gff.gz
19a54b30d9fcdc1bdff15dd57d3ebe53  ./GCF_000018445.1_ASM1844v1_protein.faa.gz
f1216342941f7ec20fc52d35391c7a98  ./GCF_000018445.1_ASM1844v1_protein.gpff.gz
4ec4c663b32630249858689a42609eac  ./GCF_000018445.1_ASM1844v1_rna_from_genomic.fna.gz
f0cff22a6c824dc98013967ecfe8a418  ./GCF_000018445.1_ASM1844v1_translated_cds.faa.gz
b70f0ea964ce5c4f79deca5b287919f1  ./annotation_hashes.txt
ADD REPLYlink modified 2.4 years ago by Joe18k • written 2.4 years ago by sharmatina18905940

should be included in *.faa

ADD REPLYlink written 2.4 years ago by Sishuo Wang200

The MD5sums are always provided. They correspond to the files you need which should be present in a folder named GCF_000....

You command is also wrong. complete,chromosome is not one argument to the --assembly-level option. You should specify one or the other. Similarly, bacteria is also a positional argument and should come last in the command.

Make sure you read the documentation on the github page.

Try:

ncbi-genome-download -s refseq -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

or

 ncbi-genome-download -s genbank -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

I got:

$  ls /refseq/bacteria

GCF_000018445.1:
GCF_000018445.1_ASM1844v1_protein.faa.gz  MD5SUMS

GCF_000021145.1:
GCF_000021145.1_ASM2114v1_protein.faa.gz  MD5SUMS

GCF_000021245.2:
GCF_000021245.2_ASM2124v2_protein.faa.gz  MD5SUMS

GCF_000069245.1:
GCF_000069245.1_ASM6924v1_protein.faa.gz  MD5SUMS

GCF_000186665.3:
GCF_000186665.3_ASM18666v4_protein.faa.gz  MD5SUMS

GCF_000187205.2:
GCF_000187205.2_ASM18720v4_protein.faa.gz  MD5SUMS
...
...

You may still end up with some empty folders, so you'll need to pull out all the fasta files seperately after with something like find ./ -name "*.faa.gz"

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Joe18k

Hi Joe hope you are save and well.

Why this works:

ncbi-genome-download -n -s refseq bacteria --genera Zhihengliuella -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10

INFO: Using cached summary.
Considering the following 1 assemblies for download:
GCF_002848265.1 Zhihengliuella sp. ISTPL4   ISTPL4

And this script dont?

echo "Downloading genomes from NCBI"

input="bac_taxa.txt"

while IFS= read -r line
do
  mkdir $line
  cd $line
  echo "Downloading $line genomes from NCBI"
  ncbi-genome-download -n -s refseq bacteria --genera $line -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10

  cd ..
done < "$input"

Downloading Acidiphilium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipila genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipropionibacterium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisarcina genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisoma genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisphaera genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithiobacillus genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithrix genomes from NCBI
Unsupported assembly level: cds-fasta

My list of genera(example):

Acidovorax
Acinetobacter
Acrocarpospora
Actibacterium
Actinoallomurus
Actinoalloteichus
Actinobacillus
Actinobacteria
actinobacterium
Actinobaculum
Actinocatenispora
Actinocorallia
Actinocrispum
Actinokineospora
Actinomadura

I used to download all genomic fasta and works just fine! Thanks

ADD REPLYlink written 5 weeks ago by psschlogl30
1

Im not at a computer to test this at the moment, but my guess would be that your loop isn't synthesising the command properly. It may be the quotes around cds-fasta. Check your command is well formed and introduce the flags one by one in the loop to narrow down the issue.

ADD REPLYlink written 5 weeks ago by Joe18k

Yeah that worked. Thanks man. Paulo

ADD REPLYlink written 5 weeks ago by psschlogl30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 919 users visited in the last hour
_