Question

Masurca genome kunitg not finishing

0

Entering edit mode

3.8 years ago

olivier_armant • 0

Dear all, I am trying to use masurca to assembl a large vertebrate genome (estimated 4 to 5Gb) from illumina PE (coverage near 100X), MP (coverage near 70X) and few Nanopore long reads (coverage 1X). I encounter a very long lag time of more that 15 days at the step creating unitgs, and would like to know if this is normal or not.Should I kill this job with less mate pairs data? I have post this issue on hitub but so far got no response from the developers. I hope the community can help

Here the current log:
[Thu Jun 11 13:47:49 CEST 2020] Processing pe library reads
[Thu Jun 11 17:20:33 CEST 2020] Processing sj library reads
[Thu Jun 11 19:14:48 CEST 2020] Average PE read length 175
[Thu Jun 11 19:14:49 CEST 2020] Using kmer size of 67 for the graph
[Thu Jun 11 19:14:49 CEST 2020] MIN_Q_CHAR: 33
[Thu Jun 11 19:14:50 CEST 2020] Creating mer database for Quorum
[Thu Jun 11 21:58:17 CEST 2020] Error correct PE
[Fri Jun 12 03:56:09 CEST 2020] Error correct JUMP
[Fri Jun 12 06:22:37 CEST 2020] Estimating genome size
[Fri Jun 12 08:52:17 CEST 2020] Estimated genome size: 5938548948
[Fri Jun 12 08:52:17 CEST 2020] Creating k-unitigs with k=67
[Fri Jun 12 18:36:45 CEST 2020] Creating k-unitigs with k=31

=> k-unitgs running for more than 15 days

Here the config file
# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:
# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads
# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be
# innies, i.e. --->.<---, and JUMP are assumed to be outties
# <---.--->. If there are any jump libraries that are innies, such as
# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads
# are optional for PE libraries and mandatory for JUMP libraries. Any
# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first
# converted into Celera Assembler compatible .frg files (see
# http://wgs-assembler.sourceforge.com)
DATA
#Illumina paired end reads supplied as <two-character prefix=""> <fragment mean=""> <fragment stdev=""> <forward_reads> <reverse_reads>
#if single-end, do not specify <reverse_reads>
#MUST HAVE Illumina paired end reads to use MaSuRCA
PE= p1 350 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.2.fq.gz
PE= p2 550 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.2.fq.gz
PE= p3 350 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R2.fastq.gz
PE= p4 550 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R2.fastq.gz
#Illumina mate pair reads supplied as <two-character prefix=""> <fragment mean=""> <fragment stdev=""> <forward_reads> <reverse_reads>
JUMP= m1 4000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R2.fastq.gz
JUMP= m2 3000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R2.fastq.gz
JUMP= m3 2500 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R2.fastq.gz
JUMP= m4 4000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.2.fq.gz
JUMP= m5 3000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.2.fq.gz
JUMP= m6 2500 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.2.fq.gz
#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
#if you have both types of reads supply them both as NANOPORE type
#PACBIO=/FULL_PATH/pacbio.fa
NANOPORE=/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Nanopore_nottrimmed.fastq.gz
#Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many
#OTHER=/FULL_PATH/file.frg
#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data
#REFERENCE=/FULL_PATH/nanopore.fa
END

PARAMETERS
#PLEASE READ all comments to essential parameters below, and set the parameters according to your project
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)
USE_LINKING_MATES = 1
#specifies whether to run the assembly on the grid
USE_GRID=0
#specifies grid engine to use SGE or SLURM
GRID_ENGINE=SGE
#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=500000000
#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads
#can increase this to 30 or 35 if your reads are short (N50<7000bp)
LHE_COVERAGE=25
#set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1
MEGA_READS_ONE_PASS=0
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina  or long read data
CLOSE_GAPS=1
#number of cpus to use, set this to the number of CPUs/threads per node you will be using
NUM_THREADS = 96
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20
JF_SIZE = 60000000000
#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.
#Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
#If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).
#Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.
#A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.
#Works well even when MEGA_READS_ONE_PASS is set to 1.
#DO NOT use if you have less than 15x coverage by long reads.
FLYE_ASSEMBLY=0
END

assembly genome • 942 views

ADD COMMENT • link 3.7 years ago by olivier_armant • 0

score 0 · Answer 1 · 2020-08-03

Dear all, Here is an update of my experience using MAsurca. Following the error below I choosed to use only part of the paired end and mate pairas data. Indeed I have both 100bp and 250 bp long reads for the MP and the PE librairies. Using only the 250bp libraries MP and PE seems to work, but when I try to incorporate 100bp data then it fails. I thought Masurca can use mixture of PE and MP of different size, but in may case this does not seem to be the case. This is actually a pity because that means that I can only use a fraction of my data…

Any help would help :)

Best Olivier