Question: Rsync and python for ftp.ncbi
0
gravatar for andrewsanchez
2.8 years ago by
Arizona
andrewsanchez10 wrote:

I'm trying to call rsync within a python loop to get files from NCBI.

After reading the man page on filtering rules and looking here: http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files

I don't understand why the code below doesn't work.

ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
dirs = ftp.nlst()

for organism in dirs:
    latest = os.path.join(organism, "latest_assembly_versions")
    for path in ftp.nlst(latest):
        accession = path.split("/")[-1]
        fasta = accession+"_genomic.fna.gz"
        subprocess.call(['rsync',
                         '--recursive',
                         '--copy-links',
                         #'--dry-run',
                         '-f=+ '+accession+'/*',
                         '-f=+ '+fasta,
                         '-f=- *',
                         'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest,
                         '--log-file=scratch/test_dir/log.txt',
                         'scratch/' + organism])

I also tried '--exclude=*[^'+fasta+']' to try to exclude files that don't match fasta instead of -f=- *

For each directory path within latest/*, I want the file that matches fasta exactly. There will always be exactly one file fasta in the directory latest/path.

EDIT: I am testing this with rsync version 3.1.0 and have seen incompatibility issues with earlier versions.

Here is a link to working code that you should be able to paste into a python interpreter to get the results of a "dry run," which won't download anything onto your machine: http://pastebin.com/0reVKMCg it gets EVERYTHING under ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest, which is not what I want. and if I run that script with '-f=- *' uncommented, it doesn't get anything, which seems to contradict the answer here http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files

In my script above, the variable dirs holds a list of all the organisms you will see at ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ and each one of those directories has a subdirectory latest_assembly_versions/, the contents of which I am looping through.

rsync ftp ncbi • 1.5k views
ADD COMMENTlink modified 2.7 years ago by Biostar ♦♦ 20 • written 2.8 years ago by andrewsanchez10
1

I think you are falling in the trap of the XY problem. Please tell us more about what you are trying to do. Even without the details of what you really want to do, I doubt you need to get python involved here, most probably rsync by itself can do it.

ADD REPLYlink written 2.8 years ago by Carlos Borroto1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1838 users visited in the last hour