Question: Fail job after all vs all : Error, 'unable to ReadProgram on SLURM cluster
0
gravatar for gusa_10
4 weeks ago by
gusa_100
gusa_100 wrote:

While running OMA 1.1.2 I got an error message while running my single job after my parallelize allvsall job - on slurm cluster:

Error, 'unable to ReadProgram(Cache/AllAll/id01/id04/part_3032-3127)

The single bash job is still running on the cluster but it has been almost two days already and no update in my job.out file.

Meanwhile, the job.err file shows:

rm: cannot remove ‘Cache/conversion.running’

Could you suggest me what to do? Should I let it run anyway?

Thank you

oma • 262 views
ADD COMMENTlink modified 19 days ago • written 4 weeks ago by gusa_100
1

This post is lacking sufficient detail. Please include information about the program being used, exact command line with options and the kind of analysis that is being done with type of data.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax37k

Tagging: adrian.altenhoff

ADD REPLYlink written 29 days ago by genomax37k
0
gravatar for adrian.altenhoff
29 days ago by
Switzerland
adrian.altenhoff330 wrote:

Hi Gusa

I'm one of the OMA maintainers. The version you are using is already a bit out-dated. The way the parallel processing of jobs has since been improved quite a bit. If possible, I suggest to upgrade OMA to the latest version.

Most likely the referred chunk is corrupted, something that could happen on slow filesystems on older versions of OMA. Best is to abort the run, remove this chunk and restart the job. It should only need to redo this single chunk, so should not take long and then it should continue with the inference of the orthologs.

About the conversion.running problem: I assume that this file has already been removed by another job. if not, remove it prior to restat oma.

Good luck with the run! Best wishes Adrian

ADD COMMENTlink written 29 days ago by adrian.altenhoff330

Hi Adrian,

Thank you very much. I am encountering new problems with the latest version of OMA. After getting several message of this type on my slrum job.out (except for the last line):

You specified to stop after the database conversion step (i.e. you set the "-c" flag). Database conversion successfully finished.

I got an error message on the job.err:

OMA.2.1.1/bin/../darwinlib/../data/GOdata.drw-20171023: 76.7% -- replaced wit ../data/GOdata.drw-20171023

While the last line of the job.out says:

: waiting for too long. abort. It seems that your parallelisation ...

I started my job with the options: ..oma -n 20 -c

Any suggestions?

Thank you so much in advance!

ADD REPLYlink modified 28 days ago • written 28 days ago by gusa_100

Problem solved! I just need more memory

ADD REPLYlink written 28 days ago by gusa_100
0
gravatar for gusa_10
19 days ago by
gusa_100
gusa_100 wrote:

Hi Adrian,

I re-run the analysis with OMA 2.1.1 and still have the same error message "Error, 'unable to ReadProgram(Cache/AllAll/sp1/sp2/part_1042-1106)" Should keep deleting these corrupted files and re-run again?

Thank you!

ADD COMMENTlink written 19 days ago by gusa_100

yes. might be useful to check why they are failing in the scheduler's log (e.g. too little memory allocated to the process, or too little runtime reserved?). Cheers Adrian

ADD REPLYlink written 15 days ago by adrian.altenhoff330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour