Question: Fail job after all vs all : Error, 'unable to ReadProgram on SLURM cluster
0
gravatar for gusa_10
7 months ago by
gusa_100
gusa_100 wrote:

While running OMA 1.1.2 I got an error message while running my single job after my parallelize allvsall job - on slurm cluster:

Error, 'unable to ReadProgram(Cache/AllAll/id01/id04/part_3032-3127)

The single bash job is still running on the cluster but it has been almost two days already and no update in my job.out file.

Meanwhile, the job.err file shows:

rm: cannot remove ‘Cache/conversion.running’

Could you suggest me what to do? Should I let it run anyway?

Thank you

oma • 467 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by gusa_100
1

This post is lacking sufficient detail. Please include information about the program being used, exact command line with options and the kind of analysis that is being done with type of data.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax49k

Tagging: adrian.altenhoff

ADD REPLYlink written 7 months ago by genomax49k
0
gravatar for adrian.altenhoff
7 months ago by
Switzerland
adrian.altenhoff380 wrote:

Hi Gusa

I'm one of the OMA maintainers. The version you are using is already a bit out-dated. The way the parallel processing of jobs has since been improved quite a bit. If possible, I suggest to upgrade OMA to the latest version.

Most likely the referred chunk is corrupted, something that could happen on slow filesystems on older versions of OMA. Best is to abort the run, remove this chunk and restart the job. It should only need to redo this single chunk, so should not take long and then it should continue with the inference of the orthologs.

About the conversion.running problem: I assume that this file has already been removed by another job. if not, remove it prior to restat oma.

Good luck with the run! Best wishes Adrian

ADD COMMENTlink written 7 months ago by adrian.altenhoff380

Hi Adrian,

Thank you very much. I am encountering new problems with the latest version of OMA. After getting several message of this type on my slrum job.out (except for the last line):

You specified to stop after the database conversion step (i.e. you set the "-c" flag). Database conversion successfully finished.

I got an error message on the job.err:

OMA.2.1.1/bin/../darwinlib/../data/GOdata.drw-20171023: 76.7% -- replaced wit ../data/GOdata.drw-20171023

While the last line of the job.out says:

: waiting for too long. abort. It seems that your parallelisation ...

I started my job with the options: ..oma -n 20 -c

Any suggestions?

Thank you so much in advance!

ADD REPLYlink modified 7 months ago • written 7 months ago by gusa_100

Problem solved! I just need more memory

ADD REPLYlink written 7 months ago by gusa_100

wrong thread, sorry,

ADD REPLYlink modified 6 months ago • written 6 months ago by andrespara0
0
gravatar for gusa_10
7 months ago by
gusa_100
gusa_100 wrote:

Hi Adrian,

I re-run the analysis with OMA 2.1.1 and still have the same error message "Error, 'unable to ReadProgram(Cache/AllAll/sp1/sp2/part_1042-1106)" Should keep deleting these corrupted files and re-run again?

Thank you!

ADD COMMENTlink written 7 months ago by gusa_100

yes. might be useful to check why they are failing in the scheduler's log (e.g. too little memory allocated to the process, or too little runtime reserved?). Cheers Adrian

ADD REPLYlink written 7 months ago by adrian.altenhoff380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1406 users visited in the last hour