OMA core dumped error in a single node/machine
2.9 years ago
andrespara ▴ 10

Dear all,

I am having problems running OMA, I am running it in a single machine/node to avoid errors or conflict with the open grid engine. Even when I use OMA -c first and then OMA -n 9 the node drops to half that amount of processors in a few minutes and now a single processor is running.

At first it runs like this with this error at the end

Starting database conversion and checks...
Starting database conversion and checks...
NR_PROCESSES := 9
THIS_PROC_NR := 5
NR_PROCESSES := 9
THIS_PROC_NR := 3
NR_PROCESSES := 9
THIS_PROC_NR := 2
NR_PROCESSES := 9
THIS_PROC_NR := 6
NR_PROCESSES := 9
THIS_PROC_NR := 8
Starting database conversion and checks...
Starting database conversion and checks...
Starting database conversion and checks...
NR_PROCESSES := 9
Starting database conversion and checks...
THIS_PROC_NR := 9
Starting database conversion and checks...
Starting database conversion and checks...
Process 9717 on ubuntu-node9: job nr 4 of 9
Process 9714 on ubuntu-node9: job nr 1 of 9
Process 9720 on ubuntu-node9: job nr 7 of 9
Process 9716 on ubuntu-node9: job nr 3 of 9
Process 9715 on ubuntu-node9: job nr 2 of 9
Process 9718 on ubuntu-node9: job nr 5 of 9
Process 9719 on ubuntu-node9: job nr 6 of 9
Process 9721 on ubuntu-node9: job nr 8 of 9
Process 9722 on ubuntu-node9: job nr 9 of 9
1547818214.356338 - 1 - [pid 9714]: Computing specie1 vs specie1 (Part 2 of 263) Mem: 0.158GB
1547818214.484520 - 1 - [pid 9718]: Computing specie1 vs specie1 (Part 5 of 263) Mem: 0.158GB
1547818214.553570 - 1 - [pid 9721]: Computing specie1 vs specie1 (Part 18 of 263) Mem: 0.158GB
1547818214.663331 - 1 - [pid 9717]: Computing specie1 vs specie1 (Part 6 of 263) Mem: 0.158GB
1547818214.983455 - 1 - [pid 9716]: Computing specie1 vs specie1 (Part 1 of 263) Mem: 0.158GB
1547818214.985786 - 1 - [pid 9719]: Computing specie1 vs specie1 (Part 13 of 263) Mem: 0.158GB
1547818215.223881 - 1 - [pid 9715]: Computing specie1 vs specie1 (Part 20 of 263) Mem: 0.158GB
1547818215.477820 - 1 - [pid 9720]: Computing specie1 vs specie1 (Part 15 of 263) Mem: 0.158GB
1547818215.498398 - 1 - [pid 9722]: Computing specie1 vs specie1 (Part 3 of 263) Mem: 0.158GB
1547818256.459699 - 1 - [pid 9719]:   5.00% complete, time left for this part=0.22h, 0.3% of AllAll done. Mem: 0.211GB
1547818266.933788 - 1 - [pid 9719]:   10.00% complete, time left for this part=0.13h, 0.3% of AllAll done. Mem: 0.211GB
lostorage=61001728, s=4611686018640287784, historage=287372032
Irrecoverable system error gc-5
running Logger

Then after some parts were computed... these lines appeared

lostorage=53690368, s=4611686018568851256, historage=516004576
Irrecoverable system error gc-5
running Align
/usr/local/bin/OMA/bin/OMA: line 236:  9714 Aborted                 (core dumped) $OMA_PATH/bin/omadarwin${darwin_flag}  <<EOF
${OneMachineParallelInfo}$drw_cmds
done;
EOF

/usr/local/bin/OMA/bin/OMA: line 236:  9719 Aborted                 (core dumped) $OMA_PATH/bin/omadarwin${darwin_flag}  <<EOF
${OneMachineParallelInfo}$drw_cmds
done;
EOF

How could I fix this, as I stated above, only one processor is running now. Thanks

could it be that that node has too little memory? I haven't seen this type of error before, and it is a bit weird that with a single process it works. Those processes do not even communicate with each other... Adrian

That particular node worked fine some months ago but along other more powerful nodes in a parallelization with open grid engine, I thought maybe something changed or maybe the parallelization is failing here [in the meantime I launched again a test with only 3 individuals and it also dropped to a single processor in less than an hour]

Is there some way to diagnose or confirm this? I used a small dataset (3 individuals) in case some unexpected collisions occurred. Thanks

Dear andre,

can you reproduce this error also on a different compute node? The error indicates indeed a very general memory problem. Did you for example reserve enough resources through the scheduler? If you can reproduce it, the best way would be to make your concrete dataset available to me (best as a complete tarball of your project data) and some additional information on the hardware you use to run it (OS, version, memory, etc of compute node and the parameters you used to start an interactive? session on that node).

Thanks Adrian, we will try to reproduce the error in other machines, by the way those machines have Ubuntu 14.04 running so that might interfere with the analysis (I am pushing or asking for an upgrade ATM). In the meantime another run in another Desktop machine outside the cluster was interrupted by a blackout (bad luck) but it seems it was working fine before that interruption, so all points that maybe that machine in the cluster is not properly working. Thanks for your help.

ok, thanks for the update. Ubuntu14.04 should in principle work fine - we have tested OmaStandalone on this setup and used it also for some analysis. Feel free to get back to me (also via omabrowser contact email) if you get stuck. But I agree that it could be simply a hardware problem of that node.

Hi Andre,

I had another look into your original error and tried to figure out if the value the structure points to could actually be some meaningful data, but it turns out that this is very unlikely. It's certainly not a string value and unlikely to be even a floating point value. Let me know in case you get more insights... Adrian

Thanks, we abandoned that node for the moment and moved the work to another machine (a Desktop in fact). If/when I put more work on schedule with that grid an machines I will pay attention to that error and if it appears only for that node or somewhere else. Thanks!

