Rsync and python for ftp.ncbi
0
0
Entering edit mode
7.8 years ago

I'm trying to call rsync within a python loop to get files from NCBI.

After reading the man page on filtering rules and looking here: http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files

I don't understand why the code below doesn't work.

ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
dirs = ftp.nlst()

for organism in dirs:
    latest = os.path.join(organism, "latest_assembly_versions")
    for path in ftp.nlst(latest):
        accession = path.split("/")[-1]
        fasta = accession+"_genomic.fna.gz"
        subprocess.call(['rsync',
                         '--recursive',
                         '--copy-links',
                         #'--dry-run',
                         '-f=+ '+accession+'/*',
                         '-f=+ '+fasta,
                         '-f=- *',
                         'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest,
                         '--log-file=scratch/test_dir/log.txt',
                         'scratch/' + organism])

I also tried '--exclude=*[^'+fasta+']' to try to exclude files that don't match fasta instead of -f=- *

For each directory path within latest/*, I want the file that matches fasta exactly. There will always be exactly one file fasta in the directory latest/path.

EDIT: I am testing this with rsync version 3.1.0 and have seen incompatibility issues with earlier versions.

Here is a link to working code that you should be able to paste into a python interpreter to get the results of a "dry run," which won't download anything onto your machine: http://pastebin.com/0reVKMCg it gets EVERYTHING under ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest, which is not what I want. and if I run that script with '-f=- *' uncommented, it doesn't get anything, which seems to contradict the answer here http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files

In my script above, the variable dirs holds a list of all the organisms you will see at ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ and each one of those directories has a subdirectory latest_assembly_versions/, the contents of which I am looping through.

rsync ftp ncbi • 3.8k views
ADD COMMENT
1
Entering edit mode

I think you are falling in the trap of the XY problem. Please tell us more about what you are trying to do. Even without the details of what you really want to do, I doubt you need to get python involved here, most probably rsync by itself can do it.

ADD REPLY

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6