Question

Batch download assemblies based on accesion number with shell script

0

Entering edit mode

8.1 years ago

armedcoffe445 • 0

Hello,

is there a rather fast way to download assemblies by only having a list of accession numbers?

Could this be done with a shell script?

Basically, by having this number NC_002695 for Escherichia coli O157:H7 str. Sakai, to download all the protein sequence from for that organism.

genome • 1.9k views

ADD COMMENT • link updated 8.1 years ago by GenoMax 141k • written 8.1 years ago by armedcoffe445 • 0

score 0 · Answer 1 · 2016-03-21

0

Entering edit mode

8.1 years ago

GenoMax 141k

No shell script needed if you need proteins from just this one genome.

ADD COMMENT • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Batch download as in I have more queries. Plural. Not one.

ADD REPLY • link 8.1 years ago by armedcoffe445 • 0

0

Entering edit mode

If you need the protein sequence then you can iterate over your list of accession numbers and create links in the format of the link above. The "all" genomes directory is large and it may timeout in a browser. You will have to map your accession numbers to GCF* numbers.

A possible loop can be as simple as

#!/bin/bash

for i in `cat your_accession_number_file.txt`;

do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/$i/$i\_protein.faa.gz

done;

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Great!

But can there be a way to differentiate between representative and reference genomes?

ADD REPLY • link 8.1 years ago by armedcoffe445 • 0

0

Entering edit mode

RefSeq bacterial genomes are here: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/
Problem is you will need to choose between multiple options once you are inside the directory of a genome. There are more than one versions (e.g. ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/) and there are no protein sequences available in this hierarchy.

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Okay. But how can I map accession numbers to GCF? That seems a bit tricky.

ADD REPLY • link 8.1 years ago by armedcoffe445 • 0

0

Entering edit mode

There is probably a file in there somewhere that has the info. I could not find it easily. If you have the species names then you could try file1 and file2 and get the GCF#.

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks! I'll another question about where I can find that file. That would be extremely useful!

ADD REPLY • link 8.1 years ago by armedcoffe445 • 0