Question

ALLPATHS-LG denovo assembly input file pre-processing and total execution time

0

Entering edit mode

6.3 years ago

bio_d ▴ 20

Hi,

I am trying a de novo assembly of a reptile (genome size comparable to human). I have Illumina paired-end library (~200b) and two Illumina mate pair libraries (5.2kb and 10kb). In addition, I have a Pacbio library(<10X coverage).

I am a little confused, ALLPATHS-lg has some scripts/ executables/ modules such as CleanCorrectedReads, CorrectLongReads, ErrorCorrectJump, ErrorCorrectReads.pl, EvaluateCorrectedPairs, so should one provide all the libraries as is (i.e. as obtained from the sequence centers) and let ALLPATHS-lg use it's own Error Correction machinery to correct all libraries.

Nonetheless, I trimmed the Illumina libraries based on quality and adapter while the Pacbio sequences were corrected using LoRDEC (making use of all the short reads libraries ).

However, when I use these corrected and trimmed sequences in the "PrepareAllPathsInputs.pl" pipeline for ALLPATHS-lg only the short reads are added to the database (although output log table has a note that Pacbio library will be added to the database). I can see the SUB_DIR (which is "data" in my case) populated with the following files:

frag_reads_orig.fastb, frag_reads_orig.qualb, frag_reads_orig.pairs

jump_reads_orig.fastb, jump_reads_orig.qualb, jump_reads_orig.pairs

However, similar files corresponding to the Pacbio libraries(long_jump_reads_orig.fastb, long_jump_reads_orig.qualb, long_jump_reads_orig.pairs) are absent. At first, I tried with the LoRDEC corrected Pacbio sequences as fasta files and ALLPATHS-lg complained that there was no quality score (rightly so because fasta does not have quality information). Then I used a fasta_to_fastq perl script to convert corrected Pacbio to fastq format and reran the Prepareallpathsinputs.pl step but to my surprise, the output log for the PrepareAllPathsInputs step displays that

==================== WARNINGS ====================

!!!! No 'long_jump' cached read groups found. Long jumping reads (typically 40 kb, < 1x coverage) are useful only for scaffolding of vertebrate size genomes, and are not required for an assembly.

!!!! No 'long' cached read groups found. Long reads (typically, unpaired, 1 kb, 50x coverage) are useful only for gap patching of relatively small genomes, and are not required for an assembly.

==================================================

my in_groups.csv file looks like

group_name, library_name, file_name

paired-end1, frag_lib1, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib1/*.fq.gz

paired-end7, frag_lib7, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib7/*.fq.gz

paired-end8, frag_lib8, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib8/*.fq.gz

mate-pair_1, jump_lib1, /home/user1/DENOVO/draft_genome/data/jump_lib1/*.fq.gz

mate-pair_2, jump_lib2, /home/user1/DENOVO/draft_genome/data/jump_lib2/*.fq.gz

pacbioreads, pacbio_long, /home/user1/DENOVO/draft_genome/data/pacbio_long/subreads*.fastq

Can anyone help me find out the mistake? Additionally, I want to know what is the approximate running time needed to assemble a human size genome using ALLPATHS-lg. Any suggestions are welcome.

sequence correction ALLPATHS_LG Hybrid assembly • 2.0k views

ADD COMMENT • link updated 5.9 years ago by h.mon 35k • written 6.3 years ago by bio_d ▴ 20

score 0 · Answer 1 · 2018-05-20

You did no mistake, as far as I can see: you just don't have the mentioned library types. Warnings are not errors, and you should be able to proceed with the assembly step.

(Very) approximately, I would guess at least 10 days running, but: 1) you didn't tell us your computing resources; 2) some genome characteristics (such as repeat content) can cause 10x or 100x differences in assembly time, for "similar" sized genomes.