Question

Retrieve directory structure of ftp.ncbi - python or bash

1

Entering edit mode

7.4 years ago

andrewsanchez ▴ 10

I need to know the contents of the latest_assembly_versions directories for each species here: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/

In other words, I need to ls every file matching this pattern: bacteria/*/latest_assembly_versions/GCA* and write the results to a file.

I am going to use this information to organize the genomes once I download them all in order to replicate the way they have the genomes organized in the respective species folders.

Currently, I am running this for each species, but it is extremely inefficient, taking several hours to complete:

def get_latest_assembly_versions(genbank_mirror, species):

      latest_dir = os.path.join(species, "latest_assembly_versions")
      latest_assembly_versions = os.path.join(info_dir, "latest_assembly_versions.csv")
      try:
          complete_ids = [complete_id.split("/")[-1] for complete_id in ftp.nlst(latest_dir)]
          print(species, len(complete_ids))
          short_ids = ["_".join(accession_id.split("_")[:2]) for accession_id in complete_ids]
          complete_and_short = zip(complete_ids, short_ids)
          with open(latest_assembly_versions, "a") as f:
              for item in complete_and_short:
                  complete_id = item[0]
                  short_id = item[1]
                  f.write("{},{},{}\n".format(species, short_id, complete_id))
      except error_temp:
          continue

Suggestions greatly appreciated. Thanks!

python bash • 1.7k views

ADD COMMENT • link updated 5.7 years ago by Alex Reynolds 35k • written 7.4 years ago by andrewsanchez ▴ 10

1

Entering edit mode

This file has all the paths you need.

ADD REPLY • link 7.4 years ago by GenoMax 142k

0

Entering edit mode

I do not need the paths. Thankfully, that info is in the file you pointed to. Although, if you want only the ftp paths for bacteria, as I do, you will want to use this I need to know where to put those genomes once I download them, so that I can mirror the directory structure NCBI has. Those paths in the column 'ftp_path' of assembly_summary.txt are symbolically linked into the directories here: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*/latest_assembly_versions

This info is not contained in assembly_summary.txt

That file has a column 'organism_names' which contains 41045 unique values, whereas there are only 19321 species directories here ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*

I need the mapping of the 85908 genomes contained in assembly_summary.txt to their respective species directories of which there are 19321 as reflected here

ADD REPLY • link 7.4 years ago by andrewsanchez ▴ 10

0

Entering edit mode

Something that might make this tricky is the fact that the files in the latest_assembly_versions directories are symbolic links.

ADD REPLY • link 7.4 years ago by andrewsanchez ▴ 10

score 1 · Answer 1 · 2018-08-06

Maybe you could loop through the output of the following command:

$ curl -s --ftp-method nocwd --list-only ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/*/latest_assembly_versions/ | awk '{print "ftp://ftp.ncbi.nlm.nih.gov/"$1}' > list.txt

Which looks something like this:

$ head list.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Streptomyces_armeniacus/latest_assembly_versions/GCA_003355155.1_ASM335515v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Bacillus_sp._COPE52/latest_assembly_versions/GCA_003355115.1_ASM335511v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/_Flavobacterium_thermophilum/latest_assembly_versions/GCA_900450595.1_51354_H01
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Phocoenobacter_uteri/latest_assembly_versions/GCA_900454895.1_51184_D02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Veillonella_criceti/latest_assembly_versions/GCA_900460315.1_51395_D01
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Legionella_busanensis/latest_assembly_versions/GCA_900461525.1_50618_A02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Legionella_beliardensis/latest_assembly_versions/GCA_900452395.1_50618_B01
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Kingella_potus/latest_assembly_versions/GCA_900451175.1_50465_B02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Iodobacter_fluviatilis/latest_assembly_versions/GCA_900451195.1_48853_F02
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Mycolicibacterium_aichiense/latest_assembly_versions/GCA_900453085.1_49677_D01

Or perhaps you could use the curl command above as a start for what you really want to do.