Masurca genome kunitg not finishing
1
0
Entering edit mode
3.8 years ago

Dear all, I am trying to use masurca to assembl a large vertebrate genome (estimated 4 to 5Gb) from illumina PE (coverage near 100X), MP (coverage near 70X) and few Nanopore long reads (coverage 1X). I encounter a very long lag time of more that 15 days at the step creating unitgs, and would like to know if this is normal or not.Should I kill this job with less mate pairs data? I have post this issue on hitub but so far got no response from the developers. I hope the community can help


Here the current log:
[Thu Jun 11 13:47:49 CEST 2020] Processing pe library reads
[Thu Jun 11 17:20:33 CEST 2020] Processing sj library reads
[Thu Jun 11 19:14:48 CEST 2020] Average PE read length 175
[Thu Jun 11 19:14:49 CEST 2020] Using kmer size of 67 for the graph
[Thu Jun 11 19:14:49 CEST 2020] MIN_Q_CHAR: 33
[Thu Jun 11 19:14:50 CEST 2020] Creating mer database for Quorum
[Thu Jun 11 21:58:17 CEST 2020] Error correct PE
[Fri Jun 12 03:56:09 CEST 2020] Error correct JUMP
[Fri Jun 12 06:22:37 CEST 2020] Estimating genome size
[Fri Jun 12 08:52:17 CEST 2020] Estimated genome size: 5938548948
[Fri Jun 12 08:52:17 CEST 2020] Creating k-unitigs with k=67
[Fri Jun 12 18:36:45 CEST 2020] Creating k-unitigs with k=31

=> k-unitgs running for more than 15 days


Here the config file
# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:
# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads
# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be
# innies, i.e. --->.<---, and JUMP are assumed to be outties
# <---.--->. If there are any jump libraries that are innies, such as
# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads
# are optional for PE libraries and mandatory for JUMP libraries. Any
# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first
# converted into Celera Assembler compatible .frg files (see
# http://wgs-assembler.sourceforge.com)
DATA
#Illumina paired end reads supplied as <two-character prefix=""> <fragment mean=""> <fragment stdev=""> <forward_reads> <reverse_reads>
#if single-end, do not specify <reverse_reads>
#MUST HAVE Illumina paired end reads to use MaSuRCA
PE= p1 350 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE_250bp.2.fq.gz
PE= p2 550 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE_250bp.2.fq.gz
PE= p3 350 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_350PE.R2.fastq.gz
PE= p4 550 150  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_550PE.R2.fastq.gz
#Illumina mate pair reads supplied as <two-character prefix=""> <fragment mean=""> <fragment stdev=""> <forward_reads> <reverse_reads>
JUMP= m1 4000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb.R2.fastq.gz
JUMP= m2 3000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb.R2.fastq.gz
JUMP= m3 2500 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R1.fastq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree.R2.fastq.gz
JUMP= m4 4000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_5kb_250bp.2.fq.gz
JUMP= m5 3000 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_8kb_250bp.2.fq.gz
JUMP= m6 2500 500  /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.1.fq.gz /TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Sample_Y2_gelfree_250bp.2.fq.gz
#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
#if you have both types of reads supply them both as NANOPORE type
#PACBIO=/FULL_PATH/pacbio.fa
NANOPORE=/TMPLOCAL/olivier_data/Seq_nottrimmed/seq/Nanopore_nottrimmed.fastq.gz
#Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many
#OTHER=/FULL_PATH/file.frg
#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data
#REFERENCE=/FULL_PATH/nanopore.fa
END

PARAMETERS
#PLEASE READ all comments to essential parameters below, and set the parameters according to your project
#set this to 1 if your Illumina jumping library reads are shorter than 100bp
EXTEND_JUMP_READS=0
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)
USE_LINKING_MATES = 1
#specifies whether to run the assembly on the grid
USE_GRID=0
#specifies grid engine to use SGE or SLURM
GRID_ENGINE=SGE
#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY
GRID_QUEUE=all.q
#batch size in the amount of long read sequence for each batch on the grid
GRID_BATCH_SIZE=500000000
#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads
#can increase this to 30 or 35 if your reads are short (N50<7000bp)
LHE_COVERAGE=25
#set to 0 (default) to do two passes of mega-reads for slower, but higher quality assembly, otherwise set to 1
MEGA_READS_ONE_PASS=0
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina  or long read data
CLOSE_GAPS=1
#number of cpus to use, set this to the number of CPUs/threads per node you will be using
NUM_THREADS = 96
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20
JF_SIZE = 60000000000
#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.
#Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data
SOAP_ASSEMBLY=0
#If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).
#Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.
#A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.
#Works well even when MEGA_READS_ONE_PASS is set to 1.
#DO NOT use if you have less than 15x coverage by long reads.
FLYE_ASSEMBLY=0
END
assembly genome • 942 views
ADD COMMENT
0
Entering edit mode
3.7 years ago

Dear all, Here is an update of my experience using MAsurca. Following the error below I choosed to use only part of the paired end and mate pairas data. Indeed I have both 100bp and 250 bp long reads for the MP and the PE librairies. Using only the 250bp libraries MP and PE seems to work, but when I try to incorporate 100bp data then it fails. I thought Masurca can use mixture of PE and MP of different size, but in may case this does not seem to be the case. This is actually a pity because that means that I can only use a fraction of my data…

Any help would help :)

Best Olivier

ADD COMMENT

Login before adding your answer.

Traffic: 3054 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6