I am trying a de novo assembly of a reptile (genome size comparable to human). I have Illumina paired-end library (~200b) and two Illumina mate pair libraries (5.2kb and 10kb). In addition, I have a Pacbio library(<10X coverage).
I am a little confused, ALLPATHS-lg has some scripts/ executables/ modules such as CleanCorrectedReads, CorrectLongReads, ErrorCorrectJump, ErrorCorrectReads.pl, EvaluateCorrectedPairs, so should one provide all the libraries as is (i.e. as obtained from the sequence centers) and let ALLPATHS-lg use it's own Error Correction machinery to correct all libraries.
Nonetheless, I trimmed the Illumina libraries based on quality and adapter while the Pacbio sequences were corrected using LoRDEC (making use of all the short reads libraries ).
However, when I use these corrected and trimmed sequences in the "PrepareAllPathsInputs.pl" pipeline for ALLPATHS-lg only the short reads are added to the database (although output log table has a note that Pacbio library will be added to the database). I can see the SUB_DIR (which is "data" in my case) populated with the following files:
frag_reads_orig.fastb, frag_reads_orig.qualb, frag_reads_orig.pairs
jump_reads_orig.fastb, jump_reads_orig.qualb, jump_reads_orig.pairs
However, similar files corresponding to the Pacbio libraries(long_jump_reads_orig.fastb, long_jump_reads_orig.qualb, long_jump_reads_orig.pairs) are absent. At first, I tried with the LoRDEC corrected Pacbio sequences as fasta files and ALLPATHS-lg complained that there was no quality score (rightly so because fasta does not have quality information). Then I used a fasta_to_fastq perl script to convert corrected Pacbio to fastq format and reran the Prepareallpathsinputs.pl step but to my surprise, the output log for the PrepareAllPathsInputs step displays that
==================== WARNINGS ====================
!!!! No 'long_jump' cached read groups found. Long jumping reads (typically 40 kb, < 1x coverage) are useful only for scaffolding of vertebrate size genomes, and are not required for an assembly.
!!!! No 'long' cached read groups found. Long reads (typically, unpaired, 1 kb, 50x coverage) are useful only for gap patching of relatively small genomes, and are not required for an assembly.
my in_groups.csv file looks like
group_name, library_name, file_name
paired-end1, frag_lib1, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib1/*.fq.gz
paired-end7, frag_lib7, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib7/*.fq.gz
paired-end8, frag_lib8, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib8/*.fq.gz
mate-pair_1, jump_lib1, /home/user1/DENOVO/draft_genome/data/jump_lib1/*.fq.gz
mate-pair_2, jump_lib2, /home/user1/DENOVO/draft_genome/data/jump_lib2/*.fq.gz
pacbioreads, pacbio_long, /home/user1/DENOVO/draft_genome/data/pacbio_long/subreads*.fastq
Can anyone help me find out the mistake? Additionally, I want to know what is the approximate running time needed to assemble a human size genome using ALLPATHS-lg. Any suggestions are welcome.