Question: how to download all protein sequnces of a bacteria using ncbi ftp site?
0
gravatar for sharmatina189059
6 months ago by
United States
sharmatina18905930 wrote:

How can I download all protein sequences of complete genome sequences of Acinetobacter baumannii from ncbi ftp site?

ncbi • 346 views
ADD COMMENTlink modified 6 months ago by jrj.healey11k • written 6 months ago by sharmatina18905930
1
gravatar for jrj.healey
6 months ago by
jrj.healey11k
United Kingdom
jrj.healey11k wrote:

See my answer here: A: How to extract Refseq of downloaded files from NCBI

ADD COMMENTlink written 6 months ago by jrj.healey11k

OP needs to get *protein.faa.gz files since protein data is needed.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax64k
1

OP take a look at the help for ncbi-genome-download. Give the option --format protein-fasta to get what you want.

(or download the genome or CDS data and tranform it yourself)

ADD REPLYlink modified 6 months ago • written 6 months ago by jrj.healey11k

I am running this command ncbi-genome-download -l complete,chromosome bacteria --genus "Acinetobacter baumannii" --format protein-fasta

but this gives me MD5SUMS file names like this. I need fasta sequnces.

260ac38772d1f9d98641f03bc5b07596  ./GCF_000018445.1_ASM1844v1_assembly_report.txt
d3b3df68700a410823ff5ab347294110  ./GCF_000018445.1_ASM1844v1_assembly_stats.txt
3c329eae370e70cb5fe3d318944ff2a9  ./GCF_000018445.1_ASM1844v1_cds_from_genomic.fna.gz
283123b31bc184dad8a5112758c3dac8  ./GCF_000018445.1_ASM1844v1_feature_count.txt.gz
3b2c5e5971cf64dec0cbc9b4105e4723  ./GCF_000018445.1_ASM1844v1_feature_table.txt.gz
21d351875d083b9d039e5152ee386b85  ./GCF_000018445.1_ASM1844v1_genomic.fna.gz
44f267ac471a1a751e007a77f2be976f  ./GCF_000018445.1_ASM1844v1_genomic.gbff.gz
8ab9cd32a6125e45e478315c4e933905  ./GCF_000018445.1_ASM1844v1_genomic.gff.gz
19a54b30d9fcdc1bdff15dd57d3ebe53  ./GCF_000018445.1_ASM1844v1_protein.faa.gz
f1216342941f7ec20fc52d35391c7a98  ./GCF_000018445.1_ASM1844v1_protein.gpff.gz
4ec4c663b32630249858689a42609eac  ./GCF_000018445.1_ASM1844v1_rna_from_genomic.fna.gz
f0cff22a6c824dc98013967ecfe8a418  ./GCF_000018445.1_ASM1844v1_translated_cds.faa.gz
b70f0ea964ce5c4f79deca5b287919f1  ./annotation_hashes.txt
ADD REPLYlink modified 6 months ago by jrj.healey11k • written 6 months ago by sharmatina18905930

should be included in *.faa

ADD REPLYlink written 6 months ago by Sishuo Wang160

The MD5sums are always provided. They correspond to the files you need which should be present in a folder named GCF_000....

You command is also wrong. complete,chromosome is not one argument to the --assembly-level option. You should specify one or the other. Similarly, bacteria is also a positional argument and should come last in the command.

Make sure you read the documentation on the github page.

Try:

ncbi-genome-download -s refseq -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

or

 ncbi-genome-download -s genbank -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

I got:

$  ls /refseq/bacteria

GCF_000018445.1:
GCF_000018445.1_ASM1844v1_protein.faa.gz  MD5SUMS

GCF_000021145.1:
GCF_000021145.1_ASM2114v1_protein.faa.gz  MD5SUMS

GCF_000021245.2:
GCF_000021245.2_ASM2124v2_protein.faa.gz  MD5SUMS

GCF_000069245.1:
GCF_000069245.1_ASM6924v1_protein.faa.gz  MD5SUMS

GCF_000186665.3:
GCF_000186665.3_ASM18666v4_protein.faa.gz  MD5SUMS

GCF_000187205.2:
GCF_000187205.2_ASM18720v4_protein.faa.gz  MD5SUMS
...
...

You may still end up with some empty folders, so you'll need to pull out all the fasta files seperately after with something like find ./ -name "*.faa.gz"

ADD REPLYlink modified 6 months ago • written 6 months ago by jrj.healey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour