I'm trying to run OMA 2.2.0 on my university's HPC environment. They have a shared gluster file system and have told me that they don't want users to run jobs that write directly to this system. Instead they want me to run jobs that write their output to each compute node's local filesystem, to be copied to somewhere at the end of the job. Their reasoning is that if all users wrote directly to the gluster system, then everyone's reads/writes would slow down. Anyway, this means I can't run parallel OMA jobs on many CPUs that point to the same directory. I have 70 genomes to run through OMA, so I want to use hundreds of CPUs if possible.
With a bit of tinkering, I was able to figure out that if I create zero-sized files in the Cache/AllAll/<genome1>/<genome2> directory, that OMA will skip that particular genome1-genome2 comparison. I notice that the gz files are named things like part_X-Y.gz. My question is, how does OMA determine how many parts/gz files there should be in a particular directory? It doesn't seem to be constant.
My plan is to start many different jobs with specific "gaps" in the AllAll directory structure, to coerce each job to do a particular part of the All-v-All stage. However, I'd like to have some way of knowing which filenames are expected ahead of time (i.e. the values of X and Y in part_X-Y.gz). After deleting the zero-sized files, the directories will be combined with rsync. Convoluted, I know, but those are the constraints I am under.
I would be interested to hear your thoughts,