Question: slurm script for canu assembler
0
gravatar for bio_d
10 months ago by
bio_d0
bio_d0 wrote:

Hi, I am trying to use canu assembler for my raw pacbio data. However, canu assembler fails to do the three steps, viz., correcting reads, trimming reads and assembling corrected and trimmed reads since it exits even before detecting the available resources (which it is supposed to auto-detect). Cant figure out the reason. Please help.

Best, bio_d

Std error: (snippet of last few relevant lines)

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_144' (from '/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/bin/java').

srun: error: node01: task 1: Exited with exit code 1

srun: error: node02: task 0: Exited with exit code 1

srun: error: node05: task 2: Exited with exit code 1

srun: error: node04: task 0: Exited with exit code 1

the slurm script for submission:

#!/bin/bash
#SBATCH -N4
#SBATCH --job-name=assemble
#SBATCH --workdir=./
#SBATCH -o ./assemble_out.txt 
#SBATCH -e ./assemble_err.txt 

start_time=$(date +%s)
echo "Initial time " $start_time


# Correct input data
echo -e "\n Correct raw PacBio data \n"

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -correct -p test -d test_pacbio genomeSize=2.81g useGrid=true -pacbio-raw /home/user1/PACBIO/*.fastq.gz

# Trim input data
echo -e "\n Trim corrected PacBio data \n"

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -trim -p test -d test_pacbio genomeSize=2.81g useGrid=false -pacbio-corrected /home/user1/PACBIO/test_pacbio/test.correctedReads.fasta.gz

# Assemble data
echo -e "\n Assemble trimmed and corrected PacBio data \n"

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -assemble -p test -d test_pacbio_assembly1 genomeSize=2.81g useGrid=false -pacbio-corrected /home/user1/PACBIO/test_pacbio/test.trimmedReads.fasta.gz

stop_time=$(date +%s)
echo "Final time " $stop_time

execution_time=$(expr $stop_time - $start_time)
echo -e "Execution time " $execution_time " seconds \n "
echo -e "\t \t" $(($execution_time/60)) " minutes \n"
echo -e "\t \t" $(($execution_time/3600)) " hours \n"
canu denovo usegrid • 835 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by bio_d0

the error output does not show much, can you add more from the error message

ADD REPLYlink written 10 months ago by Medhat7.7k

What else is there in assemble_err.txt?

You are not asking for a specific wall time/memory. These may need to be set on your cluster (every cluster has different default values). You are also asking for 4 full nodes but no specific assignment of cores. Does your cluster allow you to run without that specification? You may need to include some of these options in canu commandline with gridOptions= directive.

ADD REPLYlink written 10 months ago by genomax55k

Thank you guys for the suggestions. The default wall time if not specified for this system is 365 days (for registered user's which I am. Hence, no problem with that. As for the #SBATCH options they are optional as I have run jobs on this cluster with specific values set and sometimes not specifying any particular values (default setup). Only mandatory option is number of nodes requested as far as my system administrator had told me.

Thought of using only correction step to test canu using single node and got halted with the error given below.

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -correct -p test -d test_pacbio genomeSize=2.81g useGrid=true -pacbio-raw /home/user1/PACBIO/*.fastq.gz

Can you guys help me set appropriate memory usage for merylMemory? Thanks in advance.

The assemble_err.txt is below:

-- Canu 1.6

--

-- CITATIONS

--

-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.

-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

-- Genome Res. 2017 May;27(5):722-736.

-- http://doi.org/10.1101/gr.215087.116

--

-- Read and contig alignments during correction, consensus and GFA building use:

-- Šošic M, Šikic M.

-- Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.

-- Bioinformatics. 2017 May 1;33(9):1394-1395.

-- http://doi.org/10.1093/bioinformatics/btw753

--

-- Overlaps are generated using:

-- Berlin K, et al.

-- Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

-- Nat Biotechnol. 2015 Jun;33(6):623-30.

-- http://doi.org/10.1038/nbt.3238

--

-- Myers EW, et al.

-- A Whole-Genome Assembly of Drosophila.

-- Science. 2000 Mar 24;287(5461):2196-204.

-- http://doi.org/10.1126/science.287.5461.2196

--

-- Li H.

-- Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

-- Bioinformatics. 2016 Jul 15;32(14):2103-10.

-- http://doi.org/10.1093/bioinformatics/btw152

--

-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:

-- Chin CS, et al.

-- Phased diploid genome assembly with single-molecule real-time sequencing.

-- Nat Methods. 2016 Dec;13(12):1050-1054.

-- http://doi.org/10.1038/nmeth.4035

--

-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:

-- Chin CS, et al.

-- Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

-- Nat Methods. 2013 Jun;10(6):563-9

-- http://doi.org/10.1038/nmeth.2474

--

-- CONFIGURE CANU

--

-- Detected Java(TM) Runtime Environment '1.8.0_144' (from '/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/bin/java').

-- Detected 8 CPUs and 63 gigabytes of memory.

-- Detected Slurm with 'sinfo' binary in /cm/shared/apps/slurm/16.05.8/bin/sinfo.

-- Detected Slurm with 'MaxArraySize' limited to 10000 jobs.

--

-- Found 32 hosts with 8 cores and 62 GB memory under Slurm control.

-- Found 17 hosts with 1 core and 0 GB memory under Slurm control.

--

--

-- ERROR

-- ERROR

-- ERROR Found 2 machine configurations:

-- ERROR class0 - 32 machines with 8 cores with 62.79296875 GB memory each.

-- ERROR class1 - 17 machines with 1 cores with 0.0009765625 GB memory each.

-- ERROR

-- ERROR Task meryl can't run on any available machines.

-- ERROR It is requesting:

-- ERROR merylMemory=64-256 memory (gigabytes)

-- ERROR merylThreads=1-32 threads

-- ERROR

-- ERROR No available machine configuration can run this task.

-- ERROR

-- ERROR Possible solutions:

-- ERROR Change merylMemory and/or merylThreads

-- ERROR

ABORT:

ABORT: Canu 1.6

ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.

ABORT: Try restarting. If that doesn't work, ask for help.

ABORT:

ABORT: task meryl failed to find a configuration to run on.

ABORT:

srun: error: node04: task 0: Exited with exit code 1

ADD REPLYlink written 10 months ago by bio_d0

It is clear from the log that:

-- ERROR No available machine configuration can run this task.

because what was detected on your grid is

class0 - 32 machines with 8 cores with 62.79296875 GB memory each.

class1 - 17 machines with 1 cores with 0.0009765625 GB memory each.

with max memory as you can see ~63 GB

maybe you can try to set in the batch file

#SBATCH --mem=60000

which will set the max memory to 60 GB which what you have in your cluster

ADD REPLYlink modified 10 months ago • written 10 months ago by Medhat7.7k

Thank you. The canu job is running. I had to submit in the head node and miraculously canu started submitting jobs to the compute nodes.

However, while the job is indeed running, I am a bit confused because as the input files I have kept just the subreads.fastq.gz files. Should I also use the scraps.fastq.gz files (I didn't use the scraps.fastq,gz files because I took the name literally and discarded them but when I used zcat scarps.fastq.gz | head I find they too seem to have read sequences.

So should I include the scraps*.fastq.gz files in the INPUT PACBIO RAW DATA directory?

ADD REPLYlink modified 10 months ago • written 10 months ago by bio_d0

Happy that this fixed your issue. what is this scarps.fastq.gz where did you get it from?

ADD REPLYlink written 10 months ago by Medhat7.7k

See this: http://seqanswers.com/forums/showpost.php?p=202171&postcount=9

ADD REPLYlink written 10 months ago by genomax55k

It is one of the Pacbio output files. You can have a look at this link http://www.pacb.com/wp-content/uploads/SMRT-Link-Getting-Started-Guide-v4.0.0.pdf (page 25).

ADD REPLYlink written 10 months ago by bio_d0

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

You should also try specifying a specific partition (queue equivalent) that you have access to where those 4 nodes can be found.

#SBATCH -p name_of_SLURM_partition
ADD REPLYlink written 10 months ago by genomax55k

I am sorry that I didn't use the ADD REPLY option. I will keep that in mind in the future. Thank you for your help.

However, while the job is indeed running, I am a bit confused because as the input files I have kept just the subreads.fastq.gz files. Should I also use the scraps.fastq.gz files (I didn't use the scraps.fastq,gz files because I took the name literally and discarded them but when I used zcat scarps.fastq.gz | head I find they too seem to have read sequences.

So should I include the scraps*.fastq.gz files in the INPUT PACBIO RAW DATA directory?

ADD REPLYlink modified 10 months ago • written 10 months ago by bio_d0

You are fine not including the scraps file. Take a look at this answer from Dr. Richard Hall over on SeqAnswers. He works at PacBio.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax55k

Thank you very much.

ADD REPLYlink written 10 months ago by bio_d0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1198 users visited in the last hour