Entering edit mode
9.6 years ago
caizexi123
▴
60
Hi, guys!
I am running Abyss on cluster, I compiled the Abyss-1.5.2 with the OpenMPI-1.4.3 the server provided on my director on one of the nod.
I compiled completed, and there is Abyss-P in the installed director.
Then I submit my job via LSF system:
APP_NAME=bioloong
NP=10
RUN="/share/home/jinlab/bin/bin/abyss-pe k=50 name=coix in='/share/home/jinlab/data/test/coix_1.fastq'"
But the job was killed, with the LSF output:
Sender: LSF System <lsfadmin1@blade115>
Subject: Job 481296: <testjob> Exited
Job <testjob> was submitted from host <bio-login2> by user <jinlab> in cluster <bcc_cloud1>.
Job was executed on host(s) <10*blade115>, in queue <bioloong>, as user <jinlab> in cluster <bcc_cloud1>.
</share/home/jinlab> was used as the home directory.
</share/home/jinlab/data/test> was used as the working directory.
Started at Sun Mar 22 17:03:51 2015
Results reported at Sun Mar 22 17:15:29 2015
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
testjob
------------------------------------------------------------
Exited with exit code 2.
Resource usage summary:
CPU time : 1.10 sec.
Max Memory : 65 MB
Max Swap : 603 MB
Max Processes : 36
Max Threads : 44
The output (if any) follows:
RUN=/share/home/jinlab/bin/bin/abyss-pe k=50 name=coix in='/share/home/jinlab/data/test/coix_1.fastq'
/lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/mpirun.lsf -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
...
/lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/mpirun.lsf -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
--------------------------------------------------------------------------
mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
[blade115:26042] [[7220,1],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105
[blade115:26042] [[7220,1],1] could not get route to [[INVALID],INVALID]
[...
[blade115:26051] [[7220,1],9] could not get route to [[INVALID],INVALID]
[blade115:26051] [[7220,1],9] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 86
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 错误 1
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 错误 1
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 错误 1
make: *** [coix-1.fa] 错误 1
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
Mar 22 17:04:14 2015 25906 7 8.0.1 reportJRusage: failed to send rusage report to SBD
Mar 22 17:08:53 2015 25988 4 8.0.1 checkPJLStartup: PAM has received no feedback from any TaskStarter for 300 seconds after PJL has started. Shutting down the job ...
...
Mar 22 17:08:53 2015 25986 4 8.0.1 checkPJLStartup: PAM has received no feedback from any TaskStarter for 300 seconds after PJL has started. Shutting down the job ...
Mar 22 17:14:13 2015 25906 Last message repeated 574 time(s).
Mar 22 17:14:14 2015 25906 7 8.0.1 reportJRusage: failed to send rusage report to SBD
Mar 22 17:14:56 2015 25988 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 已杀死
Mar 22 17:14:59 2015 25992 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:02 2015 25989 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:06 2015 25987 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:26 2015 25984 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 已杀死
Mar 22 17:15:26 2015 25986 3 8.0.1 PAM: waitForPJLExit: Timed out while waiting for PJL to exit. Sending SIGKILL
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper -np 10 ABYSS-P -k50 -q3 --coverage-hist=coverage.hist -s coix-bubbles.fa -o coix-1.fa /share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00001 blade115 Undefined
...
00010 blade115 Undefined
make: *** [coix-1.fa] 已杀死
Job /lsf1/8.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper /share/home/jinlab/bin/bin/abyss-pe k=50 name=coix in=/share/home/jinlab/data/test/coix_1.fastq
TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00000 blade115 /share/home/jinl Exit (2) 03/22/2015 17:15:06
...
00009 blade115 /share/home/jinl Exit (2) 03/22/2015 17:15:2
Anyone knows what's wrong?
I'm afraid that those error messages don't mean much to me. They look pretty specific to LSF, which I haven't used. Sorry I couldn't be of more help.
I can't tell what the problem is either, unfortunately.
Have you successfully run MPI jobs on your cluster before? If not, I would suggest compiling and testing with a simple MPI "Hello, World!" program, such as the one provided here. That would let you know whether the problem is specific to ABySS or if it is a problem with the job submission parameters.