Hi guys,
I am trying to use the Dedupe.sh tool from BBTools developed by Brian Bushnell (Brian Bushnell) to find overlaps in my de novo assembled contigs. The input file is a big fasta file containing all contigs generated from different assemblers (MIRA, ABySS, SOAPdenovo, SPAdes), with the sequence headers modified to contain just a sequential counter like >1, >2, ... and so on. I am using the overlap graph generated by Dedupe.sh to merge the contigs in the overlaps using a Perl script.
The problem is if I try to run Dedupe.sh in the merged assemblies file, I get the following errors during runtime. I made a few edits to the program output pasted below for clarity:
- Only showing first 20-something bases of sequences
- Separated errors in two different blocks, as the first block of error repeated itself for quite a while
Here is the output of the Dedupe.sh run:
Command:
~/software/bbmap/dedupe.sh in=../../a6_best_assembly_v1.fasta out=a6_best_assembly_v1_DR.fasta outd=a6_best_assembly_v1_duplicates.fasta pattern=a6_best_assembly_v1_cluster% dot=a6_best_assembly_v1_overlapgraph.dot arc am ac fo c mcs=2 pc=t minidentity=95 mo=100 pto=t ngn=f sq=f ple=t overwrite=t
Output:
java -Djava.library.path=/home/pgbsilva/software/bbmap/jni/ -ea -Xmx68519m -Xms68519m -cp /home/pgbsilva/software/bbmap/current/ jgi.Dedupe in=../../a6_best_assembly_v1.fasta out=a6_best_assembly_v1_DR.fasta outd=a6_best_assembly_v1_duplicates.fasta
pattern=a6_best_assembly_v1_cluster% dot=a6_best_assembly_v1_overlapgraph.dot arc am ac fo c mcs=2 pc=t minidentity=95 mo=100 pto=t ngn=f sq=f ple=t overwrite=t
Executing jgi.Dedupe [in=../../a6_best_assembly_v1.fasta, out=a6_best_assembly_v1_DR.fasta, outd=a6_best_assembly_v1_duplicates.fasta,
pattern=a6_best_assembly_v1_cluster%, dot=a6_best_assembly_v1_overlapgraph.dot, arc, am, ac, fo, c, mcs=2, pc=t, minidentity=95, mo=100, pto=t, ngn=f, sq=f, ple=t, overwrite=t]
Initial:
Memory: max=68853m, free=68135m, used=718m
Found 0 duplicates.
Finished exact matches. Time: 0.197 seconds.
Memory: max=68853m, free=58795m, used=10058m
Found 0 contained sequences.
Finished containment. Time: 0.168 seconds.
Memory: max=68853m, free=63082m, used=5771m
Removed 0 invalid entries.
Finished invalid removal. Time: 0.002 seconds.
Memory: max=68853m, free=63082m, used=5771m
First error block
Exception in thread "Thread-72" java.lang.AssertionError:
type=FORWARD, len=567, subs=299, edits=0 (175032, length=12932, start1=12365, stop1=12931) (229892, length=1324, start2=0, stop2=566)
>1
ATTCCTTGAGTTTTTCTTCCAACCATTTTACTAACATTTTAATTTCTGCTCTCCTATTTTCAGTTATTGAGATTTTTTGCCTGGTGTTTCTGTTTATGGCCTTCTAATTTTGTTCCATGAATGCAATAAGTTCTCCT **(sequence continues)**
>2
TTTCTTCACAGAATTGGAAAAAACTACTTTAAAGTTCATATGGAACCAAAAAAGAGCCCGCATTGCCAAGTCAATCCTAAGCCAAAAGAACAAAGCTGGAGGCATCACACTACCTGACTTCAAACTATACTACAAGG **(sequence continues)**
at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-63" java.lang.AssertionError:
type=FORWARD, len=714, subs=491, edits=0 (175050, length=11792, start1=11078, stop1=11791) (210803, length=1497, start2=0, stop2=713)
>1
ACCAGCATATACAGAGACCAAATCAATACAAATAGCAAAGTTAGTATAACATGCTAGTTTTGAAATGATTAATATGTAATATGTTTTTGGAAATTATTAGTTGATTTATTCCTTTACTCACAAATATTTATTCAGT **(sequence continues)**
>2
CCGTTTTAGGCGCAACAGACCAACCAGACCAGAATGGATTCATCCATACTAAGTGCCATGTAATCAAACTGACTCATACGGACCAGTTTTCCAAAAAACCTGAAGTAGAATGAAAGGAATATAAAGGAAGATACAG **(sequence continues)**
at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-53" java.lang.AssertionError:
type=FORWARD, len=263, subs=28, edits=0 (221070, length=1188, start1=0, stop1=262) (223564, length=1064, start2=801, stop2=1063)
>1
AAGTGGAGCTGGCTTGGAAAGAATAGGGAAACGGGTGCAACTCCCGTGCGGTTACGCCGCTGTAACAAGTGACGAAGGCTTTATCTATAGCCACTGTCGCACCTGCCTCTTATACACAGCTGACGCTGCCGACGA **(sequence continues)**
>2
AGAAGACCTGCTTTTTCATGCTCATCACTCCCATGTAAATCGGGAGACTGTCTCGCTAAAGACAGGATGCTGTCTTTTATACACAGCTGACGCTGCCGACGACGCCTCTAGTTTATTCGTCTGTTGTCGCTCACA **(sequence continues)**
at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-69" java.lang.AssertionError:
type=FORWARDRC, len=912, subs=77, edits=0 (188, length=1866, start1=954, stop1=1865) (211768, length=1176, start2=1175, stop2=264)
>1
TGTCGCTACCGCGATAGGGCAAAAAGTCCTAAAGTTTAGTAAGTGTTTGCTTGGAACACTTTTTCATGAGCCCTTTAATAAGGGGCAGTGGAAGAAATTCATGTAGAGCTCCTTTTTTTTGCATCAATAGGCAA **(sequence continues)**
>2
ATAAAGCGAAAGAGAGCGCTTTTTTTTCAGCGTCTAAATTCTTCGTATGATTTCCCTCACATAGTTAGCGAAATCCATTTCCAATGCACTGCATTTGGAAATTTTTTGCCTATTGATGCAAAAAAAAGGAGCTC **(sequence continues)**
at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
at jgi.Dedupe$Unit.makeOverlapForwardRC(Dedupe.java:5229)
at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4851)
at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-59" java.lang.AssertionError:
type=FORWARD, len=623, subs=363, edits=0 (174723, length=12656, start1=12033, stop1=12655) (20470, length=9394, start2=0, stop2=622)
>1
GTTTGCGAAACTAAAGACAAAAGAAATGCCATAAAAATATCTTCTAGATGACAAAGTTGTGCCTTTTGGAGTTGCATTTTAACACATCGAAACCACTACACACATACACGGGAACTGCACAATTGGGTAAATA **(sequence continues)**
>2
ATGTGCAAGTTTGTTACATGGGTATACATGTGCTATGTTGGTTTGTTGCACCTATTAACTCATCACTTACATTGGGTATTTCTCCTAATGCTATCCTTCCTCCAGCCCCCCACCCCATGACAGGCCCCAGTGT **(sequence continues)**
at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-57" java.lang.AssertionError:
type=FORWARD, len=210, subs=15, edits=0 (66196, length=2881, start1=2671, stop1=2880) (67503, length=1435, start2=0, stop2=209)
>1
GCTCTTTGGAATGCCAGACGCAGTGGCATGTACCTGTAGACCCACCTACAAGGTGGGCTGTTGTGGGCTGTAGTGTGCTGTGATTTTGCCTGTGATACCCACTGCCCTCCAGCCTAGGCAACATAGTGAGA **(sequence continues)**
>2
GCCACAGTTTCTTAATCCAGTCTATCACTGATGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTGAATAGTGCCGCAATAAACATACGTGTGCATGTGTCTTTATAGCAGCATGATTTATAATCCTTT **(sequence continues)**
at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
This set of errors continues for a few more times. The program continues running and outputs this:
Found 241 overlaps.
Finished finding overlaps. Time: 0.109 seconds.
Memory: max=68853m, free=62391m, used=6462m
Overlaps: 540, length: 1025664
Counted overlaps. Time: 0.003 seconds.
Memory: max=68853m, free=62391m, used=6462m
Clusters: 8104 (114 of at least size 2)
Size Range Clusters Reads Bases
1 7990 7990 24600288
2 55 110 562883
3-4 42 145 665875
5-8 15 91 432193
9-16 1 15 48450
17-32 1 17 128562
Largest: 17
Finished making clusters. Time: 0.012 seconds.
Memory: max=68853m, free=62391m, used=6462m
Removed 0 invalid entries.
Finished invalid removal. Time: 0.001 seconds.
Memory: max=68853m, free=62391m, used=6462m
Second set of errors:
Exception in thread "Thread-79" java.lang.AssertionError
at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-91" java.lang.AssertionError
at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-90" Exception in thread "Thread-78" java.lang.AssertionError
at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
java.lang.AssertionError
at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-98" java.lang.AssertionError
at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2864)
at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-81" java.lang.AssertionError
at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
The second block of error also repeats a few more times.
Found 6 multijoins (4178 bases).
Experienced 0 multijoin removal failures.
Flipped 118 reads and 162 overlaps.
Found 0 clusters (0 overlaps) with contradictory orientation cycles.
Found 1 clusters (2 overlaps) with remaining cycles.
After processing clusters:
Clusters: 8019 (29 of at least size 2)
Size Range Clusters Reads Bases
1 7990 7990 24600288
2 29 79 377770
Largest: 7
Finished processing. Time: 0.016 seconds.
Memory: max=68853m, free=61728m, used=7125m
Input: 8368 reads 26438251 bases.
Duplicates: 0 reads (0.00%) 0 bases (0.00%) 0 collisions.
Containments: 0 reads (0.00%) 0 bases (0.00%) 121936 collisions.
Overlaps: 241 reads (2.88%) 512832 bases (1.94%) 1492 collisions.
Result: 8368 reads (100.00%) 26438251 bases (100.00%)
Printed output. Time: 0.161 seconds.
Memory: max=68853m, free=61673m, used=7180m
Time: 0.685 seconds.
Reads Processed: 8368 12.22k reads/sec
Bases Processed: 26438k 38.60m bases/sec
I am not experienced in Java so I have no idea what the errors mean but it seems like that for some of the threads, something happened during overlap detection between two contigs and that is causing the error. For the second set, the error happens during the canonization step of Dedupe.sh. I don't know if it happens because of the first one or if it is an unrelated incident. Also, every time I run the program with this dataset, it generates a different number of clusters.
Does any of you encountered these errors while running Dedupe.sh? Any help would be fantastic!
Can you post the exact command you are using?
Looking at the thread numbers are you spreading this job across a cluster? That is not a good idea. Can you explicitly run it constrained to a single node (by specifying correct number of
threads=n
option and exclusive use of the node, if you are using a job scheduler)?I am using only one node of a cluster we have here at Virginia Tech, as far as I am aware. Based on my limited knowledge of cluster architecture, the one I'm using here has 134 nodes, but I am logged into one of the 8 interactive nodes. Each interactive node has 24 cores and the listed CPU is 2 x E5-2680v3 2.5GHz (Haswell).I think I am not using a job scheduler to run this command, I am just running it normally from the command line.
Here is the exact command I used to run it (I'll also edit the post with the command):
I'll run the tool again setting the number of threads to 1. Is that what are you suggesting?
Looks like you are running this job on the login node, since I don't see any commands related to the job scheduler. That is not a good practice.
Do you know what job scheduler your cluster uses?
There should be a max 48 threads possible per physical node with multi-threading (if you have 24 cores per node). I am not sure where you are getting thread numbers > 48 in your errors? Perhaps the head node has CPU's with more than 48 cores.
I would say add
threads=24
and see what happens (in addition to things @Brian has asked below).Running it with
threads=24
produces the same errors as the original run, and withthreads=1
it runs until 3 of the errors from the first block appear and it freezes. Even setting the number of threads to one, errors like "Exception in thread "Thread-64" java.lang.AssertionError:" still appear.Here is some info on the cluster I am using, if it helps: www.arc.vt.edu/computing/newriver/ I know the basics of how to use servers, but I figured if they gave me access to the interactive node and I've never had to use the job scheduler, I thought I didn't need it.