I'm trying to call rsync within a python loop to get files from NCBI.
After reading the man page on filtering rules and looking here: http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files
I don't understand why the code below doesn't work.
ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
dirs = ftp.nlst()
for organism in dirs:
latest = os.path.join(organism, "latest_assembly_versions")
for path in ftp.nlst(latest):
accession = path.split("/")[-1]
fasta = accession+"_genomic.fna.gz"
subprocess.call(['rsync',
'--recursive',
'--copy-links',
#'--dry-run',
'-f=+ '+accession+'/*',
'-f=+ '+fasta,
'-f=- *',
'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest,
'--log-file=scratch/test_dir/log.txt',
'scratch/' + organism])
I also tried '--exclude=*[^'+fasta+']'
to try to exclude files that don't match fasta
instead of -f=- *
For each directory path
within latest/*
, I want the file that matches fasta
exactly. There will always be exactly one file fasta
in the directory latest/path
.
EDIT: I am testing this with rsync version 3.1.0 and have seen incompatibility issues with earlier versions.
Here is a link to working code that you should be able to paste into a python interpreter to get the results of a "dry run," which won't download anything onto your machine: http://pastebin.com/0reVKMCg it gets EVERYTHING under ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest
, which is not what I want. and if I run that script with '-f=- *'
uncommented, it doesn't get anything, which seems to contradict the answer here http://stackoverflow.com/questions/35364075/using-rsync-filter-to-include-exclude-files
In my script above, the variable dirs holds a list of all the organisms you will see at ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ and each one of those directories has a subdirectory latest_assembly_versions/, the contents of which I am looping through.
I think you are falling in the trap of the XY problem. Please tell us more about what you are trying to do. Even without the details of what you really want to do, I doubt you need to get python involved here, most probably rsync by itself can do it.