Question

slurm script for canu assembler

0

Entering edit mode

7.7 years ago

bio_d ▴ 20

Hi, I am trying to use canu assembler for my raw pacbio data. However, canu assembler fails to do the three steps, viz., correcting reads, trimming reads and assembling corrected and trimmed reads since it exits even before detecting the available resources (which it is supposed to auto-detect). Cant figure out the reason. Please help.

Best, bio_d

Std error: (snippet of last few relevant lines)

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_144' (from '/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/bin/java').

srun: error: node01: task 1: Exited with exit code 1

srun: error: node02: task 0: Exited with exit code 1

srun: error: node05: task 2: Exited with exit code 1

srun: error: node04: task 0: Exited with exit code 1

the slurm script for submission:

#!/bin/bash
#SBATCH -N4
#SBATCH --job-name=assemble
#SBATCH --workdir=./
#SBATCH -o ./assemble_out.txt 
#SBATCH -e ./assemble_err.txt 

start_time=$(date +%s)
echo "Initial time " $start_time


# Correct input data
echo -e "\n Correct raw PacBio data \n"

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -correct -p test -d test_pacbio genomeSize=2.81g useGrid=true -pacbio-raw /home/user1/PACBIO/*.fastq.gz

# Trim input data
echo -e "\n Trim corrected PacBio data \n"

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -trim -p test -d test_pacbio genomeSize=2.81g useGrid=false -pacbio-corrected /home/user1/PACBIO/test_pacbio/test.correctedReads.fasta.gz

# Assemble data
echo -e "\n Assemble trimmed and corrected PacBio data \n"

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -assemble -p test -d test_pacbio_assembly1 genomeSize=2.81g useGrid=false -pacbio-corrected /home/user1/PACBIO/test_pacbio/test.trimmedReads.fasta.gz

stop_time=$(date +%s)
echo "Final time " $stop_time

execution_time=$(expr $stop_time - $start_time)
echo -e "Execution time " $execution_time " seconds \n "
echo -e "\t \t" $(($execution_time/60)) " minutes \n"
echo -e "\t \t" $(($execution_time/3600)) " hours \n"

canu denovo useGrid • 6.2k views

ADD COMMENT • link 7.7 years ago by bio_d ▴ 20

0

Entering edit mode

the error output does not show much, can you add more from the error message

ADD REPLY • link 7.7 years ago by Medhat 9.8k

0

Entering edit mode

What else is there in assemble_err.txt?

You are not asking for a specific wall time/memory. These may need to be set on your cluster (every cluster has different default values). You are also asking for 4 full nodes but no specific assignment of cores. Does your cluster allow you to run without that specification? You may need to include some of these options in canu commandline with gridOptions= directive.

ADD REPLY • link 7.7 years ago by GenoMax 152k

0

Entering edit mode

Thank you guys for the suggestions. The default wall time if not specified for this system is 365 days (for registered user's which I am. Hence, no problem with that. As for the #SBATCH options they are optional as I have run jobs on this cluster with specific values set and sometimes not specifying any particular values (default setup). Only mandatory option is number of nodes requested as far as my system administrator had told me.

Thought of using only correction step to test canu using single node and got halted with the error given below.

srun /home/user1/DE_NOVO_ASSEMBLY/canu-1.6/Linux-amd64/bin/canu -correct -p test -d test_pacbio genomeSize=2.81g useGrid=true -pacbio-raw /home/user1/PACBIO/*.fastq.gz

Can you guys help me set appropriate memory usage for merylMemory? Thanks in advance.

The assemble_err.txt is below:

-- Canu 1.6

--

-- CITATIONS

--

-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.

-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

-- Genome Res. 2017 May;27(5):722-736.

-- http://doi.org/10.1101/gr.215087.116

--

-- Read and contig alignments during correction, consensus and GFA building use:

-- Šošic M, Šikic M.

-- Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.

-- Bioinformatics. 2017 May 1;33(9):1394-1395.

-- http://doi.org/10.1093/bioinformatics/btw753

--

-- Overlaps are generated using:

-- Berlin K, et al.

-- Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

-- Nat Biotechnol. 2015 Jun;33(6):623-30.

-- http://doi.org/10.1038/nbt.3238

--

-- Myers EW, et al.

-- A Whole-Genome Assembly of Drosophila.

-- Science. 2000 Mar 24;287(5461):2196-204.

-- http://doi.org/10.1126/science.287.5461.2196

--

-- Li H.

-- Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

-- Bioinformatics. 2016 Jul 15;32(14):2103-10.

-- http://doi.org/10.1093/bioinformatics/btw152

--

-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:

-- Chin CS, et al.

-- Phased diploid genome assembly with single-molecule real-time sequencing.

-- Nat Methods. 2016 Dec;13(12):1050-1054.

-- http://doi.org/10.1038/nmeth.4035

--

-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:

-- Chin CS, et al.

-- Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

-- Nat Methods. 2013 Jun;10(6):563-9

-- http://doi.org/10.1038/nmeth.2474

--

-- CONFIGURE CANU

--

-- Detected Java(TM) Runtime Environment '1.8.0_144' (from '/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre/bin/java').

-- Detected 8 CPUs and 63 gigabytes of memory.

-- Detected Slurm with 'sinfo' binary in /cm/shared/apps/slurm/16.05.8/bin/sinfo.

-- Detected Slurm with 'MaxArraySize' limited to 10000 jobs.

--

-- Found 32 hosts with 8 cores and 62 GB memory under Slurm control.

-- Found 17 hosts with 1 core and 0 GB memory under Slurm control.

--

-- ERROR

-- ERROR Found 2 machine configurations:

-- ERROR class0 - 32 machines with 8 cores with 62.79296875 GB memory each.

-- ERROR class1 - 17 machines with 1 cores with 0.0009765625 GB memory each.

-- ERROR

-- ERROR Task meryl can't run on any available machines.

-- ERROR It is requesting:

-- ERROR merylMemory=64-256 memory (gigabytes)

-- ERROR merylThreads=1-32 threads

-- ERROR

-- ERROR No available machine configuration can run this task.

-- ERROR

-- ERROR Possible solutions:

-- ERROR Change merylMemory and/or merylThreads

-- ERROR

ABORT:

ABORT: Canu 1.6

ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.

ABORT: Try restarting. If that doesn't work, ask for help.

ABORT:

ABORT: task meryl failed to find a configuration to run on.

ABORT:

srun: error: node04: task 0: Exited with exit code 1

ADD REPLY • link 7.7 years ago by bio_d ▴ 20

0

Entering edit mode

It is clear from the log that:

-- ERROR No available machine configuration can run this task.

because what was detected on your grid is

class0 - 32 machines with 8 cores with 62.79296875 GB memory each.

class1 - 17 machines with 1 cores with 0.0009765625 GB memory each.

with max memory as you can see ~63 GB

maybe you can try to set in the batch file

#SBATCH --mem=60000

which will set the max memory to 60 GB which what you have in your cluster

ADD REPLY • link 7.7 years ago by Medhat 9.8k

0

Entering edit mode

Thank you. The canu job is running. I had to submit in the head node and miraculously canu started submitting jobs to the compute nodes.

However, while the job is indeed running, I am a bit confused because as the input files I have kept just the subreads.fastq.gz files. Should I also use the scraps.fastq.gz files (I didn't use the scraps.fastq,gz files because I took the name literally and discarded them but when I used zcat scarps.fastq.gz | head I find they too seem to have read sequences.

So should I include the scraps*.fastq.gz files in the INPUT PACBIO RAW DATA directory?

ADD REPLY • link 7.7 years ago by bio_d ▴ 20

0

Entering edit mode

Happy that this fixed your issue. what is this scarps.fastq.gz where did you get it from?

ADD REPLY • link 7.7 years ago by Medhat 9.8k

0

Entering edit mode

See this: http://seqanswers.com/forums/showpost.php?p=202171&postcount=9

ADD REPLY • link 7.7 years ago by GenoMax 152k

0

Entering edit mode

It is one of the Pacbio output files. You can have a look at this link http://www.pacb.com/wp-content/uploads/SMRT-Link-Getting-Started-Guide-v4.0.0.pdf (page 25).

ADD REPLY • link 7.7 years ago by bio_d ▴ 20

0

Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

You should also try specifying a specific partition (queue equivalent) that you have access to where those 4 nodes can be found.

#SBATCH -p name_of_SLURM_partition

ADD REPLY • link 7.7 years ago by GenoMax 152k

0

Entering edit mode

I am sorry that I didn't use the ADD REPLY option. I will keep that in mind in the future. Thank you for your help.

However, while the job is indeed running, I am a bit confused because as the input files I have kept just the subreads.fastq.gz files. Should I also use the scraps.fastq.gz files (I didn't use the scraps.fastq,gz files because I took the name literally and discarded them but when I used zcat scarps.fastq.gz | head I find they too seem to have read sequences.

So should I include the scraps*.fastq.gz files in the INPUT PACBIO RAW DATA directory?

ADD REPLY • link 7.7 years ago by bio_d ▴ 20

0

Entering edit mode

You are fine not including the scraps file. Take a look at this answer from Dr. Richard Hall over on SeqAnswers. He works at PacBio.