Entering edit mode
8.2 years ago
armedcoffe445
•
0
Hello,
is there a rather fast way to download assemblies by only having a list of accession numbers?
Could this be done with a shell script?
Basically, by having this number NC_002695 for Escherichia coli O157:H7 str. Sakai, to download all the protein sequence from for that organism.
Batch download as in I have more queries. Plural. Not one.
If you need the protein sequence then you can iterate over your list of accession numbers and create links in the format of the link above. The "all" genomes directory is large and it may timeout in a browser. You will have to map your accession numbers to GCF* numbers.
A possible loop can be as simple as
Great!
But can there be a way to differentiate between representative and reference genomes?
RefSeq bacterial genomes are here: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/
Problem is you will need to choose between multiple options once you are inside the directory of a genome. There are more than one versions (e.g. ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/) and there are no protein sequences available in this hierarchy.
Okay. But how can I map accession numbers to GCF? That seems a bit tricky.
There is probably a file in there somewhere that has the info. I could not find it easily. If you have the species names then you could try file1 and file2 and get the GCF#.
Thanks! I'll another question about where I can find that file. That would be extremely useful!