Question: Batch download assemblies based on accesion number with shell script
0
gravatar for armedcoffe445
4.3 years ago by
armedcoffe4450 wrote:

Hello,

is there a rather fast way to download assemblies by only having a list of accession numbers?

Could this be done with a shell script?

Basically, by having this number NC_002695 for Escherichia coli O157:H7 str. Sakai, to download all the protein sequence from for that organism.

genome • 1.0k views
ADD COMMENTlink modified 4.3 years ago by genomax86k • written 4.3 years ago by armedcoffe4450
0
gravatar for genomax
4.3 years ago by
genomax86k
United States
genomax86k wrote:

No shell script needed if you need proteins from just this one genome.

ADD COMMENTlink written 4.3 years ago by genomax86k

Batch download as in I have more queries. Plural. Not one.

ADD REPLYlink written 4.3 years ago by armedcoffe4450

If you need the protein sequence then you can iterate over your list of accession numbers and create links in the format of the link above. The "all" genomes directory is large and it may timeout in a browser. You will have to map your accession numbers to GCF* numbers.

A possible loop can be as simple as

#!/bin/bash

for i in `cat your_accession_number_file.txt`;

do wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/$i/$i\_protein.faa.gz

done;
ADD REPLYlink written 4.3 years ago by genomax86k

Great!

But can there be a way to differentiate between representative and reference genomes?

ADD REPLYlink written 4.3 years ago by armedcoffe4450

RefSeq bacterial genomes are here: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/
Problem is you will need to choose between multiple options once you are inside the directory of a genome. There are more than one versions (e.g. ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/) and there are no protein sequences available in this hierarchy.

ADD REPLYlink written 4.3 years ago by genomax86k

Okay. But how can I map accession numbers to GCF? That seems a bit tricky.

ADD REPLYlink written 4.3 years ago by armedcoffe4450

There is probably a file in there somewhere that has the info. I could not find it easily. If you have the species names then you could try file1 and file2 and get the GCF#.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by genomax86k

Thanks! I'll another question about where I can find that file. That would be extremely useful!

ADD REPLYlink written 4.3 years ago by armedcoffe4450
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1130 users visited in the last hour