Question: ALLPATHS-LG denovo assembly input file pre-processing and total execution time
0
gravatar for bio_d
14 months ago by
bio_d0
bio_d0 wrote:

Hi,

I am trying a de novo assembly of a reptile (genome size comparable to human). I have Illumina paired-end library (~200b) and two Illumina mate pair libraries (5.2kb and 10kb). In addition, I have a Pacbio library(<10X coverage).

I am a little confused, ALLPATHS-lg has some scripts/ executables/ modules such as CleanCorrectedReads, CorrectLongReads, ErrorCorrectJump, ErrorCorrectReads.pl, EvaluateCorrectedPairs, so should one provide all the libraries as is (i.e. as obtained from the sequence centers) and let ALLPATHS-lg use it's own Error Correction machinery to correct all libraries.

Nonetheless, I trimmed the Illumina libraries based on quality and adapter while the Pacbio sequences were corrected using LoRDEC (making use of all the short reads libraries ).

However, when I use these corrected and trimmed sequences in the "PrepareAllPathsInputs.pl" pipeline for ALLPATHS-lg only the short reads are added to the database (although output log table has a note that Pacbio library will be added to the database). I can see the SUB_DIR (which is "data" in my case) populated with the following files:

frag_reads_orig.fastb, frag_reads_orig.qualb, frag_reads_orig.pairs

jump_reads_orig.fastb, jump_reads_orig.qualb, jump_reads_orig.pairs

However, similar files corresponding to the Pacbio libraries(long_jump_reads_orig.fastb, long_jump_reads_orig.qualb, long_jump_reads_orig.pairs) are absent. At first, I tried with the LoRDEC corrected Pacbio sequences as fasta files and ALLPATHS-lg complained that there was no quality score (rightly so because fasta does not have quality information). Then I used a fasta_to_fastq perl script to convert corrected Pacbio to fastq format and reran the Prepareallpathsinputs.pl step but to my surprise, the output log for the PrepareAllPathsInputs step displays that

==================== WARNINGS ====================

!!!! No 'long_jump' cached read groups found. Long jumping reads (typically 40 kb, < 1x coverage) are useful only for scaffolding of vertebrate size genomes, and are not required for an assembly.

!!!! No 'long' cached read groups found. Long reads (typically, unpaired, 1 kb, 50x coverage) are useful only for gap patching of relatively small genomes, and are not required for an assembly.

==================================================

my in_groups.csv file looks like

group_name, library_name, file_name

paired-end1, frag_lib1, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib1/*.fq.gz

paired-end7, frag_lib7, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib7/*.fq.gz

paired-end8, frag_lib8, /home/user1/DENOVO/draft_genome/data/frag_lib/frag_lib8/*.fq.gz

mate-pair_1, jump_lib1, /home/user1/DENOVO/draft_genome/data/jump_lib1/*.fq.gz

mate-pair_2, jump_lib2, /home/user1/DENOVO/draft_genome/data/jump_lib2/*.fq.gz

pacbioreads, pacbio_long, /home/user1/DENOVO/draft_genome/data/pacbio_long/subreads*.fastq

Can anyone help me find out the mistake? Additionally, I want to know what is the approximate running time needed to assemble a human size genome using ALLPATHS-lg. Any suggestions are welcome.

ADD COMMENTlink modified 10 months ago by h.mon24k • written 14 months ago by bio_d0
0
gravatar for h.mon
10 months ago by
h.mon24k
Brazil
h.mon24k wrote:

You did no mistake, as far as I can see: you just don't have the mentioned library types. Warnings are not errors, and you should be able to proceed with the assembly step.

(Very) approximately, I would guess at least 10 days running, but: 1) you didn't tell us your computing resources; 2) some genome characteristics (such as repeat content) can cause 10x or 100x differences in assembly time, for "similar" sized genomes.

ADD COMMENTlink written 10 months ago by h.mon24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1502 users visited in the last hour