Question: how to download all protein sequnces of a bacteria using ncbi ftp site?
0
gravatar for sharmatina189059
14 months ago by
United States
sharmatina18905940 wrote:

How can I download all protein sequences of complete genome sequences of Acinetobacter baumannii from ncbi ftp site?

ncbi • 656 views
ADD COMMENTlink modified 14 months ago by Joe14k • written 14 months ago by sharmatina18905940
1
gravatar for Joe
14 months ago by
Joe14k
United Kingdom
Joe14k wrote:

See my answer here: A: How to extract Refseq of downloaded files from NCBI

ADD COMMENTlink written 14 months ago by Joe14k

OP needs to get *protein.faa.gz files since protein data is needed.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax74k
1

OP take a look at the help for ncbi-genome-download. Give the option --format protein-fasta to get what you want.

(or download the genome or CDS data and tranform it yourself)

ADD REPLYlink modified 14 months ago • written 14 months ago by Joe14k

I am running this command ncbi-genome-download -l complete,chromosome bacteria --genus "Acinetobacter baumannii" --format protein-fasta

but this gives me MD5SUMS file names like this. I need fasta sequnces.

260ac38772d1f9d98641f03bc5b07596  ./GCF_000018445.1_ASM1844v1_assembly_report.txt
d3b3df68700a410823ff5ab347294110  ./GCF_000018445.1_ASM1844v1_assembly_stats.txt
3c329eae370e70cb5fe3d318944ff2a9  ./GCF_000018445.1_ASM1844v1_cds_from_genomic.fna.gz
283123b31bc184dad8a5112758c3dac8  ./GCF_000018445.1_ASM1844v1_feature_count.txt.gz
3b2c5e5971cf64dec0cbc9b4105e4723  ./GCF_000018445.1_ASM1844v1_feature_table.txt.gz
21d351875d083b9d039e5152ee386b85  ./GCF_000018445.1_ASM1844v1_genomic.fna.gz
44f267ac471a1a751e007a77f2be976f  ./GCF_000018445.1_ASM1844v1_genomic.gbff.gz
8ab9cd32a6125e45e478315c4e933905  ./GCF_000018445.1_ASM1844v1_genomic.gff.gz
19a54b30d9fcdc1bdff15dd57d3ebe53  ./GCF_000018445.1_ASM1844v1_protein.faa.gz
f1216342941f7ec20fc52d35391c7a98  ./GCF_000018445.1_ASM1844v1_protein.gpff.gz
4ec4c663b32630249858689a42609eac  ./GCF_000018445.1_ASM1844v1_rna_from_genomic.fna.gz
f0cff22a6c824dc98013967ecfe8a418  ./GCF_000018445.1_ASM1844v1_translated_cds.faa.gz
b70f0ea964ce5c4f79deca5b287919f1  ./annotation_hashes.txt
ADD REPLYlink modified 14 months ago by Joe14k • written 14 months ago by sharmatina18905940

should be included in *.faa

ADD REPLYlink written 14 months ago by Sishuo Wang180

The MD5sums are always provided. They correspond to the files you need which should be present in a folder named GCF_000....

You command is also wrong. complete,chromosome is not one argument to the --assembly-level option. You should specify one or the other. Similarly, bacteria is also a positional argument and should come last in the command.

Make sure you read the documentation on the github page.

Try:

ncbi-genome-download -s refseq -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

or

 ncbi-genome-download -s genbank -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

I got:

$  ls /refseq/bacteria

GCF_000018445.1:
GCF_000018445.1_ASM1844v1_protein.faa.gz  MD5SUMS

GCF_000021145.1:
GCF_000021145.1_ASM2114v1_protein.faa.gz  MD5SUMS

GCF_000021245.2:
GCF_000021245.2_ASM2124v2_protein.faa.gz  MD5SUMS

GCF_000069245.1:
GCF_000069245.1_ASM6924v1_protein.faa.gz  MD5SUMS

GCF_000186665.3:
GCF_000186665.3_ASM18666v4_protein.faa.gz  MD5SUMS

GCF_000187205.2:
GCF_000187205.2_ASM18720v4_protein.faa.gz  MD5SUMS
...
...

You may still end up with some empty folders, so you'll need to pull out all the fasta files seperately after with something like find ./ -name "*.faa.gz"

ADD REPLYlink modified 14 months ago • written 14 months ago by Joe14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1742 users visited in the last hour