I need to know the contents of the latest_assembly_versions
directories for each species here: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/
In other words, I need to ls
every file matching this pattern: bacteria/*/latest_assembly_versions/GCA*
and write the results to a file.
I am going to use this information to organize the genomes once I download them all in order to replicate the way they have the genomes organized in the respective species folders.
Currently, I am running this for each species, but it is extremely inefficient, taking several hours to complete:
def get_latest_assembly_versions(genbank_mirror, species):
latest_dir = os.path.join(species, "latest_assembly_versions")
latest_assembly_versions = os.path.join(info_dir, "latest_assembly_versions.csv")
try:
complete_ids = [complete_id.split("/")[-1] for complete_id in ftp.nlst(latest_dir)]
print(species, len(complete_ids))
short_ids = ["_".join(accession_id.split("_")[:2]) for accession_id in complete_ids]
complete_and_short = zip(complete_ids, short_ids)
with open(latest_assembly_versions, "a") as f:
for item in complete_and_short:
complete_id = item[0]
short_id = item[1]
f.write("{},{},{}\n".format(species, short_id, complete_id))
except error_temp:
continue
Suggestions greatly appreciated. Thanks!
This file has all the paths you need.
I do not need the paths. Thankfully, that info is in the file you pointed to. Although, if you want only the ftp paths for bacteria, as I do, you will want to use this I need to know where to put those genomes once I download them, so that I can mirror the directory structure NCBI has. Those paths in the column 'ftp_path' of assembly_summary.txt are symbolically linked into the directories here: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*/latest_assembly_versions
This info is not contained in assembly_summary.txt
That file has a column 'organism_names' which contains 41045 unique values, whereas there are only 19321 species directories here ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*
I need the mapping of the 85908 genomes contained in assembly_summary.txt to their respective species directories of which there are 19321 as reflected here
Something that might make this tricky is the fact that the files in the latest_assembly_versions directories are symbolic links.