Retrieve directory structure of ftp.ncbi - python or bash
1
1
Entering edit mode
7.4 years ago

I need to know the contents of the latest_assembly_versions directories for each species here: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/

In other words, I need to ls every file matching this pattern: bacteria/*/latest_assembly_versions/GCA* and write the results to a file.

I am going to use this information to organize the genomes once I download them all in order to replicate the way they have the genomes organized in the respective species folders.

Currently, I am running this for each species, but it is extremely inefficient, taking several hours to complete:

def get_latest_assembly_versions(genbank_mirror, species):

      latest_dir = os.path.join(species, "latest_assembly_versions")
      latest_assembly_versions = os.path.join(info_dir, "latest_assembly_versions.csv")
      try:
          complete_ids = [complete_id.split("/")[-1] for complete_id in ftp.nlst(latest_dir)]
          print(species, len(complete_ids))
          short_ids = ["_".join(accession_id.split("_")[:2]) for accession_id in complete_ids]
          complete_and_short = zip(complete_ids, short_ids)
          with open(latest_assembly_versions, "a") as f:
              for item in complete_and_short:
                  complete_id = item[0]
                  short_id = item[1]
                  f.write("{},{},{}\n".format(species, short_id, complete_id))
      except error_temp:
          continue

Suggestions greatly appreciated. Thanks!

python bash • 1.7k views
ADD COMMENT
1
Entering edit mode

This file has all the paths you need.

ADD REPLY
0
Entering edit mode

I do not need the paths. Thankfully, that info is in the file you pointed to. Although, if you want only the ftp paths for bacteria, as I do, you will want to use this I need to know where to put those genomes once I download them, so that I can mirror the directory structure NCBI has. Those paths in the column 'ftp_path' of assembly_summary.txt are symbolically linked into the directories here: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*/latest_assembly_versions

This info is not contained in assembly_summary.txt

That file has a column 'organism_names' which contains 41045 unique values, whereas there are only 19321 species directories here ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*

I need the mapping of the 85908 genomes contained in assembly_summary.txt to their respective species directories of which there are 19321 as reflected here

ADD REPLY
0
Entering edit mode

Something that might make this tricky is the fact that the files in the latest_assembly_versions directories are symbolic links.

ADD REPLY
1
Entering edit mode
5.7 years ago

Maybe you could loop through the output of the following command:

$ curl -s --ftp-method nocwd --list-only ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*/latest_assembly_versions/ | awk '{print "ftp://ftp.ncbi.nlm.nih.gov/"$1}' > list.txt

Which looks something like this:

$ head list.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Streptomyces_armeniacus/latest_assembly_versions/GCA_003355155.1_ASM335515v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Bacillus_sp._COPE52/latest_assembly_versions/GCA_003355115.1_ASM335511v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/_Flavobacterium_thermophilum/latest_assembly_versions/GCA_900450595.1_51354_H01
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Phocoenobacter_uteri/latest_assembly_versions/GCA_900454895.1_51184_D02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Veillonella_criceti/latest_assembly_versions/GCA_900460315.1_51395_D01
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Legionella_busanensis/latest_assembly_versions/GCA_900461525.1_50618_A02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Legionella_beliardensis/latest_assembly_versions/GCA_900452395.1_50618_B01
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Kingella_potus/latest_assembly_versions/GCA_900451175.1_50465_B02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Iodobacter_fluviatilis/latest_assembly_versions/GCA_900451195.1_48853_F02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Mycolicibacterium_aichiense/latest_assembly_versions/GCA_900453085.1_49677_D01

Or perhaps you could use the curl command above as a start for what you really want to do.

ADD COMMENT

Login before adding your answer.

Traffic: 1228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6