Error while abyss running on cluster
0
0
Entering edit mode
9.1 years ago
caizexi123 ▴ 60

Hi, guys!

I am running Abyss on cluster, I compiled the Abyss-1.5.2 with the OpenMPI-1.4.3 the server provided on my director on one of the nod.

I compiled completed, and there is Abyss-P in the installed director.

Then I submit my job via LSF system:

APP_NAME=bioloong
NP=10
RUN="/share/home/jinlab/bin/bin/abyss-pe k=50 name=coix in='/share/home/jinlab/data/test/coix_1.fastq'"

But the job was killed, with the LSF output:

Sender: LSF System <lsfadmin1@blade115>
Subject: Job 481296: <testjob> Exited

Job <testjob> was submitted from host <bio-login2> by user <jinlab> in cluster <bcc_cloud1>.
Job was executed on host(s) <10*blade115>, in queue <bioloong>, as user <jinlab> in cluster <bcc_cloud1>.
</share/home/jinlab> was used as the home directory.
</share/home/jinlab/data/test> was used as the working directory.
Started at Sun Mar 22 17:03:51 2015
Results reported at Sun Mar 22 17:15:29 2015

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
testjob
------------------------------------------------------------

Exited with exit code 2.

Resource usage summary:

    CPU time   :      1.10 sec.
    Max Memory :        65 MB
    Max Swap   :       603 MB

    Max Processes  :        36
    Max Threads    :        44

The output (if any) follows:

RUN=/share/home/jinlab/bin/bin/abyss-pe k=50 name=coix in='/share/home/jinlab/data/test/coix_1.fastq'
/lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/mpirun.lsf -np 10 ABYSS-P -k50 -q3   --coverage-hist=coverage.hist -s coix-bubbles.fa  -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq 
...
/lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/mpirun.lsf -np 10 ABYSS-P -k50 -q3   --coverage-hist=coverage.hist -s coix-bubbles.fa  -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq 
--------------------------------------------------------------------------
mpirun could not find anything to do.

It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun could not find anything to do.

It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun could not find anything to do.

It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun could not find anything to do.

It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
[blade115:26042] [[7220,1],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105
[blade115:26042] [[7220,1],1] could not get route to [[INVALID],INVALID]
[...
[blade115:26051] [[7220,1],9] could not get route to [[INVALID],INVALID]
[blade115:26051] [[7220,1],9] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 86
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...
00010 blade115                     Undefined               
make: *** [coix-1.fa] 错误 1
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...       
00010 blade115                     Undefined               
make: *** [coix-1.fa] 错误 1
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...       
00010 blade115                     Undefined               
make: *** [coix-1.fa] 错误 1
make: *** [coix-1.fa] 错误 1
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...          
00010 blade115                     Undefined               
Mar 22 17:04:14 2015 25906 7 8.0.1 reportJRusage: failed to send rusage report to SBD
Mar 22 17:08:53 2015 25988 4 8.0.1 checkPJLStartup: PAM has received no feedback from any TaskStarter for 300 seconds after PJL has started. Shutting down the job ...
...
Mar 22 17:08:53 2015 25986 4 8.0.1 checkPJLStartup: PAM has received no feedback from any TaskStarter for 300 seconds after PJL has started. Shutting down the job ...
Mar 22 17:14:13 2015 25906 Last message repeated 574 time(s).
Mar 22 17:14:14 2015 25906 7 8.0.1 reportJRusage: failed to send rusage report to SBD
Mar 22 17:14:56 2015 25988 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...   
00010 blade115                     Undefined               
make: *** [coix-1.fa] 已杀死
Mar 22 17:14:59 2015 25992 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...      
00010 blade115                     Undefined               
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:02 2015 25989 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...
00010 blade115                     Undefined               
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:06 2015 25987 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...
00010 blade115                     Undefined               
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:26 2015 25984 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...
00010 blade115                     Undefined               
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:26 2015 25986 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00001 blade115                     Undefined               
...
00010 blade115                     Undefined               
make: *** [coix-1.fa] 已杀死
Job  /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper /share/home/jinlab/bin/bin/abyss-pe k=50 name=coix in=/share/home/jinlab/data/test/coix_1.fastq

TID   HOST_NAME   COMMAND_LINE            STATUS            TERMINATION_TIME
===== ========== ================  =======================  ===================
00000 blade115   /share/home/jinl  Exit (2)                 03/22/2015 17:15:06
...
00009 blade115   /share/home/jinl  Exit (2)                 03/22/2015 17:15:2

Anyone knows what's wrong?

abyss assembly • 3.3k views
ADD COMMENT
0
Entering edit mode

I'm afraid that those error messages don't mean much to me. They look pretty specific to LSF, which I haven't used. Sorry I couldn't be of more help.

ADD REPLY
0
Entering edit mode

I can't tell what the problem is either, unfortunately.

Have you successfully run MPI jobs on your cluster before? If not, I would suggest compiling and testing with a simple MPI "Hello, World!" program, such as the one provided here. That would let you know whether the problem is specific to ABySS or if it is a problem with the job submission parameters.

ADD REPLY

Login before adding your answer.

Traffic: 3051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6