Batch download assemblies based on accesion number with shell script
1
0
Entering edit mode
8.1 years ago

Hello,

is there a rather fast way to download assemblies by only having a list of accession numbers?

Could this be done with a shell script?

Basically, by having this number NC_002695 for Escherichia coli O157:H7 str. Sakai, to download all the protein sequence from for that organism.

genome • 1.9k views
ADD COMMENT
0
Entering edit mode
8.1 years ago
GenoMax 141k

No shell script needed if you need proteins from just this one genome.

ADD COMMENT
0
Entering edit mode

Batch download as in I have more queries. Plural. Not one.

ADD REPLY
0
Entering edit mode

If you need the protein sequence then you can iterate over your list of accession numbers and create links in the format of the link above. The "all" genomes directory is large and it may timeout in a browser. You will have to map your accession numbers to GCF* numbers.

A possible loop can be as simple as

#!/bin/bash

for i in `cat your_accession_number_file.txt`;

do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/$i/$i\_protein.faa.gz

done;
ADD REPLY
0
Entering edit mode

Great!

But can there be a way to differentiate between representative and reference genomes?

ADD REPLY
0
Entering edit mode

RefSeq bacterial genomes are here: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/
Problem is you will need to choose between multiple options once you are inside the directory of a genome. There are more than one versions (e.g. ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/) and there are no protein sequences available in this hierarchy.

ADD REPLY
0
Entering edit mode

Okay. But how can I map accession numbers to GCF? That seems a bit tricky.

ADD REPLY
0
Entering edit mode

There is probably a file in there somewhere that has the info. I could not find it easily. If you have the species names then you could try file1 and file2 and get the GCF#.

ADD REPLY
0
Entering edit mode

Thanks! I'll another question about where I can find that file. That would be extremely useful!

ADD REPLY

Login before adding your answer.

Traffic: 3198 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6