Problem with "split_and_run_sparc.sh" from DBG2OLC pipeline
1
0
Entering edit mode
5.8 years ago

Hi everybody!

I'm having a problem in the consensus stage of the DBG2OLC pipeline. I'm using the script "split_and_run_sparc.sh" to obtain the "final_assembly.fasta" file from my backbone file (backbone_raw.fasta) and my reads (ctg_pb.fasta). I ran the script using the following command:

sh ./split_and_run_sparc.sh backbone_raw.fasta DBG2OLC_Consensus_info.txt ctg_pb.fasta /tmp/consensus_dir 2 >cns_log.txt

While running the script, an error messages appeared:

Traceback (most recent call last): File "./split_reads_by_backbone.py", line 131, in <module> File "./split_reads_by_backbone.py", line 122, in main IOError: [Errno 24] Too many open files: '/tmp/consensus_dir/backbone-1627.reads.fasta'

After the analysis, I observed some inconsistencies between the "backbone_raw.fasta" file and the "final_assembly.fasta" file:

---------------- Information for assembly 'backbone_raw.fasta' ----------------

                                       Number of contigs       1906
Number of contigs in scaffolds          0
Number of contigs not in scaffolds       1906
Total size of contigs  252974640
Longest contig    2502428
Shortest contig       4957
Number of contigs > 1K nt       1906 100.0%
Number of contigs > 10K nt       1872  98.2%
Number of contigs > 100K nt        512  26.9%
Number of contigs > 1M nt         31   1.6%
Number of contigs > 10M nt          0   0.0%
Mean contig size     132725
Median contig size      35400
N50 contig length     449759
L50 contig count        147


---------------- Information for assembly 'final_assembly.fasta' ----------------

                                       Number of contigs       1020
Number of contigs in scaffolds          0
Number of contigs not in scaffolds       1020
Total size of contigs  223116219
Longest contig    2502428
Shortest contig         83
Number of contigs > 1K nt       1018  99.8%
Number of contigs > 10K nt       1009  98.9%
Number of contigs > 100K nt        470  46.1%
Number of contigs > 1M nt         31   3.0%
Number of contigs > 10M nt          0   0.0%
Mean contig size     218741
Median contig size      82745
N50 contig length     548456
L50 contig count        117


The main inconsistencies between both files is that:

• The number of contigs almost halved
• The total size of the assembled genome is reduced (since I have 886 less contigs)
• Some contigs became smaller (as observed in the "Shortest contig" section)
• N50, mean and median contig sizes inflated (as a by-product of losing contigs)

Does anyone know if the inconsistencies observed between both files is determined by the error message that appeared while the script was running? Or is this the normal output one should expect after running the consensus stage of the pipeline?

P.D.: I could not run the command "ulimit -n unlimited" before running the script, since I don't have root privileges in the cluster I'm working on. Not sure if this explains the inconsistencies or the error message.

genome Assembly correction hybrid • 2.2k views
1
Entering edit mode
5.8 years ago
colindaven ★ 3.8k

I had a problem with this stage too. I never got a final assembly out but was stuck at the "backbone_raw.fa" stage.

I did have root access and tried repeatedly to set the ulimit, but it didn't work well and there is only so many times you can restart servers in a cluster without starting to annoy people.

I got a reasonable final assembly out using Racon https://github.com/isovic/racon in the end.

0
Entering edit mode

I'll try it out.

Thank you very much!

0
Entering edit mode

I am also having issues with the consensus stage of dbg2olc, but in my case the "final_assembly.fasta" that is generated is empty, even though there is no error message.

So I would like to try your suggestion and run Racon with the "backbone_raw.fasta" assembly from dbg2olc. However, I don't know which file to use as the "overlap/alignment" input file, which is necessary for Racon ("Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format"). The manual of dbg2olc is not very clear, and I'm not sure if such a file is actually generated during the assembly. Would you remember which file you used in your case or if you have to generate an overlap/alignment file with a different software?