I'm trying to call rsync within a python loop to get files from NCBI.
After reading the man page on filtering rules and looking here: http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files
I don't understand why the code below doesn't work.
ftp_site = 'ftp.ncbi.nlm.nih.gov' ftp = FTP(ftp_site) ftp.login() ftp.cwd('genomes/genbank/bacteria') dirs = ftp.nlst() for organism in dirs: latest = os.path.join(organism, "latest_assembly_versions") for path in ftp.nlst(latest): accession = path.split("/")[-1] fasta = accession+"_genomic.fna.gz" subprocess.call(['rsync', '--recursive', '--copy-links', #'--dry-run', '-f=+ '+accession+'/*', '-f=+ '+fasta, '-f=- *', 'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest, '--log-file=scratch/test_dir/log.txt', 'scratch/' + organism])
I also tried
'--exclude=*[^'+fasta+']' to try to exclude files that don't match
fasta instead of
For each directory
latest/*, I want the file that matches
fasta exactly. There will always be exactly one file
fasta in the directory
EDIT: I am testing this with rsync version 3.1.0 and have seen incompatibility issues with earlier versions.
Here is a link to working code that you should be able to paste into a python interpreter to get the results of a "dry run," which won't download anything onto your machine: http://pastebin.com/0reVKMCg it gets EVERYTHING under
ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest, which is not what I want. and if I run that script with
'-f=- *' uncommented, it doesn't get anything, which seems to contradict the answer here http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files
In my script above, the variable dirs holds a list of all the organisms you will see at ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ and each one of those directories has a subdirectory latest_assembly_versions/, the contents of which I am looping through.