Question

BBMap Dedupe.sh bug

2

Entering edit mode

7.3 years ago

pedroivo000 ▴ 110

Hi guys,

I am trying to use the Dedupe.sh tool from BBTools developed by Brian Bushnell (Brian Bushnell) to find overlaps in my de novo assembled contigs. The input file is a big fasta file containing all contigs generated from different assemblers (MIRA, ABySS, SOAPdenovo, SPAdes), with the sequence headers modified to contain just a sequential counter like >1, >2, ... and so on. I am using the overlap graph generated by Dedupe.sh to merge the contigs in the overlaps using a Perl script.

The problem is if I try to run Dedupe.sh in the merged assemblies file, I get the following errors during runtime. I made a few edits to the program output pasted below for clarity:

Only showing first 20-something bases of sequences
Separated errors in two different blocks, as the first block of error repeated itself for quite a while

Here is the output of the Dedupe.sh run:

Command:

~/software/bbmap/dedupe.sh in=../../a6_best_assembly_v1.fasta out=a6_best_assembly_v1_DR.fasta outd=a6_best_assembly_v1_duplicates.fasta pattern=a6_best_assembly_v1_cluster% dot=a6_best_assembly_v1_overlapgraph.dot arc am ac fo c mcs=2 pc=t minidentity=95 mo=100 pto=t ngn=f sq=f ple=t overwrite=t

Output:

java -Djava.library.path=/home/pgbsilva/software/bbmap/jni/ -ea -Xmx68519m -Xms68519m -cp /home/pgbsilva/software/bbmap/current/ jgi.Dedupe in=../../a6_best_assembly_v1.fasta out=a6_best_assembly_v1_DR.fasta outd=a6_best_assembly_v1_duplicates.fasta 
pattern=a6_best_assembly_v1_cluster% dot=a6_best_assembly_v1_overlapgraph.dot arc am ac fo c mcs=2 pc=t minidentity=95 mo=100 pto=t ngn=f sq=f ple=t overwrite=t
Executing jgi.Dedupe [in=../../a6_best_assembly_v1.fasta, out=a6_best_assembly_v1_DR.fasta, outd=a6_best_assembly_v1_duplicates.fasta, 
pattern=a6_best_assembly_v1_cluster%, dot=a6_best_assembly_v1_overlapgraph.dot, arc, am, ac, fo, c, mcs=2, pc=t, minidentity=95, mo=100, pto=t, ngn=f, sq=f, ple=t, overwrite=t]

Initial:
Memory: max=68853m, free=68135m, used=718m

Found 0 duplicates.
Finished exact matches.    Time: 0.197 seconds.
Memory: max=68853m, free=58795m, used=10058m

Found 0 contained sequences.
Finished containment.      Time: 0.168 seconds.
Memory: max=68853m, free=63082m, used=5771m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.002 seconds.
Memory: max=68853m, free=63082m, used=5771m

First error block

Exception in thread "Thread-72" java.lang.AssertionError: 
type=FORWARD, len=567, subs=299, edits=0 (175032, length=12932, start1=12365, stop1=12931) (229892, length=1324, start2=0, stop2=566)
>1
ATTCCTTGAGTTTTTCTTCCAACCATTTTACTAACATTTTAATTTCTGCTCTCCTATTTTCAGTTATTGAGATTTTTTGCCTGGTGTTTCTGTTTATGGCCTTCTAATTTTGTTCCATGAATGCAATAAGTTCTCCT **(sequence continues)**
>2
TTTCTTCACAGAATTGGAAAAAACTACTTTAAAGTTCATATGGAACCAAAAAAGAGCCCGCATTGCCAAGTCAATCCTAAGCCAAAAGAACAAAGCTGGAGGCATCACACTACCTGACTTCAAACTATACTACAAGG **(sequence continues)**

    at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
    at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
    at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
    at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
    at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
    at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
    at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-63" java.lang.AssertionError: 
type=FORWARD, len=714, subs=491, edits=0 (175050, length=11792, start1=11078, stop1=11791) (210803, length=1497, start2=0, stop2=713)
>1
ACCAGCATATACAGAGACCAAATCAATACAAATAGCAAAGTTAGTATAACATGCTAGTTTTGAAATGATTAATATGTAATATGTTTTTGGAAATTATTAGTTGATTTATTCCTTTACTCACAAATATTTATTCAGT **(sequence continues)**
>2
CCGTTTTAGGCGCAACAGACCAACCAGACCAGAATGGATTCATCCATACTAAGTGCCATGTAATCAAACTGACTCATACGGACCAGTTTTCCAAAAAACCTGAAGTAGAATGAAAGGAATATAAAGGAAGATACAG **(sequence continues)**

    at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
    at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
    at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
    at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
    at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
    at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
    at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-53" java.lang.AssertionError: 
type=FORWARD, len=263, subs=28, edits=0 (221070, length=1188, start1=0, stop1=262) (223564, length=1064, start2=801, stop2=1063)
>1
AAGTGGAGCTGGCTTGGAAAGAATAGGGAAACGGGTGCAACTCCCGTGCGGTTACGCCGCTGTAACAAGTGACGAAGGCTTTATCTATAGCCACTGTCGCACCTGCCTCTTATACACAGCTGACGCTGCCGACGA **(sequence continues)**
>2
AGAAGACCTGCTTTTTCATGCTCATCACTCCCATGTAAATCGGGAGACTGTCTCGCTAAAGACAGGATGCTGTCTTTTATACACAGCTGACGCTGCCGACGACGCCTCTAGTTTATTCGTCTGTTGTCGCTCACA **(sequence continues)**

    at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
    at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
    at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
    at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
    at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
    at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
    at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-69" java.lang.AssertionError: 
type=FORWARDRC, len=912, subs=77, edits=0 (188, length=1866, start1=954, stop1=1865) (211768, length=1176, start2=1175, stop2=264)
>1
TGTCGCTACCGCGATAGGGCAAAAAGTCCTAAAGTTTAGTAAGTGTTTGCTTGGAACACTTTTTCATGAGCCCTTTAATAAGGGGCAGTGGAAGAAATTCATGTAGAGCTCCTTTTTTTTGCATCAATAGGCAA **(sequence continues)**
>2
ATAAAGCGAAAGAGAGCGCTTTTTTTTCAGCGTCTAAATTCTTCGTATGATTTCCCTCACATAGTTAGCGAAATCCATTTCCAATGCACTGCATTTGGAAATTTTTTGCCTATTGATGCAAAAAAAAGGAGCTC **(sequence continues)**

    at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
    at jgi.Dedupe$Unit.makeOverlapForwardRC(Dedupe.java:5229)
    at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4851)
    at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
    at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
    at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
    at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-59" java.lang.AssertionError: 
type=FORWARD, len=623, subs=363, edits=0 (174723, length=12656, start1=12033, stop1=12655) (20470, length=9394, start2=0, stop2=622)
>1
GTTTGCGAAACTAAAGACAAAAGAAATGCCATAAAAATATCTTCTAGATGACAAAGTTGTGCCTTTTGGAGTTGCATTTTAACACATCGAAACCACTACACACATACACGGGAACTGCACAATTGGGTAAATA **(sequence continues)**
>2
ATGTGCAAGTTTGTTACATGGGTATACATGTGCTATGTTGGTTTGTTGCACCTATTAACTCATCACTTACATTGGGTATTTCTCCTAATGCTATCCTTCCTCCAGCCCCCCACCCCATGACAGGCCCCAGTGT **(sequence continues)**

    at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
    at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
    at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
    at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
    at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
    at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
    at jgi.Dedupe$HashThread.run(Dedupe.java:3085)
Exception in thread "Thread-57" java.lang.AssertionError: 
type=FORWARD, len=210, subs=15, edits=0 (66196, length=2881, start1=2671, stop1=2880) (67503, length=1435, start2=0, stop2=209)
>1
GCTCTTTGGAATGCCAGACGCAGTGGCATGTACCTGTAGACCCACCTACAAGGTGGGCTGTTGTGGGCTGTAGTGTGCTGTGATTTTGCCTGTGATACCCACTGCCCTCCAGCCTAGGCAACATAGTGAGA **(sequence continues)**
>2
GCCACAGTTTCTTAATCCAGTCTATCACTGATGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTGAATAGTGCCGCAATAAACATACGTGTGCATGTGTCTTTATAGCAGCATGATTTATAATCCTTT **(sequence continues)**

    at jgi.Dedupe$Overlap.<init>(Dedupe.java:3989)
    at jgi.Dedupe$Unit.makeOverlapReverse(Dedupe.java:5295)
    at jgi.Dedupe$Unit.makeOverlap(Dedupe.java:4856)
    at jgi.Dedupe$HashThread.findOverlaps(Dedupe.java:3410)
    at jgi.Dedupe$HashThread.processRead(Dedupe.java:3274)
    at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3152)
    at jgi.Dedupe$HashThread.run(Dedupe.java:3085)

This set of errors continues for a few more times. The program continues running and outputs this:

Found 241 overlaps.
Finished finding overlaps. Time: 0.109 seconds.
Memory: max=68853m, free=62391m, used=6462m

Overlaps:       540,    length: 1025664
Counted overlaps.          Time: 0.003 seconds.
Memory: max=68853m, free=62391m, used=6462m

Clusters:         8104 (114 of at least size 2)

Size Range        Clusters          Reads             Bases
1                 7990              7990              24600288
2                 55                110               562883
3-4               42                145               665875
5-8               15                91                432193
9-16              1                 15                48450
17-32             1                 17                128562

Largest:          17
Finished making clusters.  Time: 0.012 seconds.
Memory: max=68853m, free=62391m, used=6462m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.001 seconds.
Memory: max=68853m, free=62391m, used=6462m

Second set of errors:

Exception in thread "Thread-79" java.lang.AssertionError
    at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
    at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
    at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
    at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
    at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-91" java.lang.AssertionError
    at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
    at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
    at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
    at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
    at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-90" Exception in thread "Thread-78" java.lang.AssertionError
    at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
    at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
    at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
    at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
    at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
java.lang.AssertionError
    at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
    at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
    at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
    at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
    at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-98" java.lang.AssertionError
    at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2864)
    at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
    at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
    at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)
Exception in thread "Thread-81" java.lang.AssertionError
    at jgi.Dedupe$Overlap.flip(Dedupe.java:4126)
    at jgi.Dedupe$ClusterThread.canonicize(Dedupe.java:2858)
    at jgi.Dedupe$ClusterThread.canonicizeNeighbors(Dedupe.java:2723)
    at jgi.Dedupe$ClusterThread.canonicizeClusterBreadthFirst(Dedupe.java:2660)
    at jgi.Dedupe$ClusterThread.run(Dedupe.java:2080)

The second block of error also repeats a few more times.

Found 6 multijoins (4178 bases).
Experienced 0 multijoin removal failures.
Flipped 118 reads and 162 overlaps.
Found 0 clusters (0 overlaps) with contradictory orientation cycles.
Found 1 clusters (2 overlaps) with remaining cycles.

After processing clusters:
Clusters:         8019 (29 of at least size 2)

Size Range        Clusters          Reads             Bases
1                 7990              7990              24600288
2                 29                79                377770

Largest:          7
Finished processing.       Time: 0.016 seconds.
Memory: max=68853m, free=61728m, used=7125m

Input:                      8368 reads      26438251 bases.
Duplicates:                 0 reads (0.00%)     0 bases (0.00%)         0 collisions.
Containments:               0 reads (0.00%)     0 bases (0.00%)     121936 collisions.
Overlaps:                   241 reads (2.88%)   512832 bases (1.94%)        1492 collisions.
Result:                     8368 reads (100.00%)    26438251 bases (100.00%)

Printed output.            Time: 0.161 seconds.
Memory: max=68853m, free=61673m, used=7180m

Time:               0.685 seconds.
Reads Processed:        8368    12.22k reads/sec
Bases Processed:      26438k    38.60m bases/sec

I am not experienced in Java so I have no idea what the errors mean but it seems like that for some of the threads, something happened during overlap detection between two contigs and that is causing the error. For the second set, the error happens during the canonization step of Dedupe.sh. I don't know if it happens because of the first one or if it is an unrelated incident. Also, every time I run the program with this dataset, it generates a different number of clusters.

Does any of you encountered these errors while running Dedupe.sh? Any help would be fantastic!

Assembly software error genome bbtools • 3.4k views

ADD COMMENT • link 7.3 years ago by pedroivo000 ▴ 110

1

Entering edit mode

Can you post the exact command you are using?

Looking at the thread numbers are you spreading this job across a cluster? That is not a good idea. Can you explicitly run it constrained to a single node (by specifying correct number of threads=n option and exclusive use of the node, if you are using a job scheduler)?

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

I am using only one node of a cluster we have here at Virginia Tech, as far as I am aware. Based on my limited knowledge of cluster architecture, the one I'm using here has 134 nodes, but I am logged into one of the 8 interactive nodes. Each interactive node has 24 cores and the listed CPU is 2 x E5-2680v3 2.5GHz (Haswell).I think I am not using a job scheduler to run this command, I am just running it normally from the command line.

Here is the exact command I used to run it (I'll also edit the post with the command):

$ ~/software/bbmap/dedupe.sh in=../../a6_best_assembly_v1.fasta out=a6_best_assembly_v1_DR.fasta outd=a6_best_assembly_v1_duplicates.fasta pattern=a6_best_assembly_v1_cluster% dot=a6_best_assembly_v1_overlapgraph.dot arc am ac fo c mcs=2 pc=t minidentity=95 mo=100 pto=t ngn=f sq=f ple=t overwrite=t

I'll run the tool again setting the number of threads to 1. Is that what are you suggesting?

ADD REPLY • link 7.3 years ago by pedroivo000 ▴ 110

1

Entering edit mode

Looks like you are running this job on the login node, since I don't see any commands related to the job scheduler. That is not a good practice.

Do you know what job scheduler your cluster uses?

There should be a max 48 threads possible per physical node with multi-threading (if you have 24 cores per node). I am not sure where you are getting thread numbers > 48 in your errors? Perhaps the head node has CPU's with more than 48 cores.

I would say add threads=24 and see what happens (in addition to things @Brian has asked below).

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

Running it with threads=24 produces the same errors as the original run, and with threads=1 it runs until 3 of the errors from the first block appear and it freezes. Even setting the number of threads to one, errors like "Exception in thread "Thread-64" java.lang.AssertionError:" still appear.

Here is some info on the cluster I am using, if it helps: www.arc.vt.edu/computing/newriver/ I know the basics of how to use servers, but I figured if they gave me access to the interactive node and I've never had to use the job scheduler, I thought I didn't need it.

ADD REPLY • link 7.3 years ago by pedroivo000 ▴ 110

score 2 · Accepted Answer · 2017-01-12

2

Entering edit mode

7.3 years ago

Brian Bushnell 20k

Thanks for the detailed error report. I have a couple of ideas about this, but it might be a little difficult to test. You're getting different (and incorrect) output each time because it's crashing; it yields deterministic output when there are no crashes.

First - can you try running with adding the flag "-da" and see what happens?

Second - it's likely that "minidentity=95" is the cause of the instability. Can you try running without that flag and see what happens? If so, there may be some ways to work around it.

ADD COMMENT • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

Running the command without the minidentity=95 works and generates a large number of clusters. Here's the output:

Initial:
Memory: max=68854m, free=68136m, used=718m

Found 0 duplicates.
Finished exact matches.    Time: 0.206 seconds.
Memory: max=68854m, free=58795m, used=10059m

Found 0 contained sequences.
Finished containment.      Time: 0.178 seconds.
Memory: max=68854m, free=63083m, used=5771m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.003 seconds.
Memory: max=68854m, free=63083m, used=5771m

Found 1798 overlaps.
Finished finding overlaps. Time: 0.141 seconds.
Memory: max=68854m, free=62392m, used=6462m

Overlaps:       3596,   length: 5608490
Counted overlaps.          Time: 0.004 seconds.
Memory: max=68854m, free=62392m, used=6462m

Clusters:         6589 (907 of at least size 2)

Size Range        Clusters          Reads             Bases
1                 5682              5682              16343670
2                 496               992               3097661
3-4               291               980               3625286
5-8               115               656               3164939
9-16              4                 37                174045
17-32             1                 21                32650

Largest:          21
Finished making clusters.  Time: 0.013 seconds.
Memory: max=68854m, free=62392m, used=6462m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.001 seconds.
Memory: max=68854m, free=62392m, used=6462m

Found 22 multijoins (9796 bases).
Experienced 0 multijoin removal failures.
Flipped 901 reads and 1243 overlaps.
Found 1 clusters (1 overlaps) with contradictory orientation cycles.
Found 5 clusters (7 overlaps) with remaining cycles.

After processing clusters:
Clusters:         6589 (907 of at least size 2)

Size Range        Clusters          Reads             Bases
1                 5682              5682              16343670
2                 907               2686              10094581

Largest:          21
Finished processing.       Time: 0.027 seconds.
Memory: max=68854m, free=61729m, used=7125m

Input:                      8368 reads      26438251 bases.
Duplicates:                 0 reads (0.00%)     0 bases (0.00%)         0 collisions.
Containments:               0 reads (0.00%)     0 bases (0.00%)     121936 collisions.
Overlaps:                   1798 reads (21.49%)     2804245 bases (10.61%)      110403 collisions.
Result:                     8368 reads (100.00%)    26438251 bases (100.00%)

Running the same command as the original post with the -da flag, it also works but it also generates a large number of clusters:

Clusters:         4304 (853 of at least size 2)

Size Range        Clusters          Reads             Bases
1                 3451              3451              7506775
2                 467               934               2092638
3-4               272               920               2556773
5-8               107               607               1612588
9-16              4                 36                70874
17-32             2                 50                226356
2049-4096         1                 2370              12372247

Largest:          2370
Finished making clusters.  Time: 0.014 seconds.
Memory: max=68854m, free=62392m, used=6462m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.001 seconds.
Memory: max=68854m, free=62392m, used=6462m

Found 32 multijoins (11768 bases).
Experienced 0 multijoin removal failures.
Flipped 1911 reads and 3621 overlaps.
Found 2 clusters (36 overlaps) with contradictory orientation cycles.
Found 12 clusters (15 overlaps) with remaining cycles.

After processing clusters:
Clusters:         4304 (853 of at least size 2)

Size Range        Clusters          Reads             Bases
1                 3451              3451              7506775
2                 853               4917              18931476

Largest:          2370
Finished processing.       Time: 0.071 seconds.
Memory: max=68854m, free=61729m, used=7125m

Input:                      8368 reads      26438251 bases.
Duplicates:                 0 reads (0.00%)     0 bases (0.00%)         0 collisions.
Containments:               0 reads (0.00%)     0 bases (0.00%)     121936 collisions.
Overlaps:                   4352 reads (52.01%)     5796362 bases (21.92%)      48322 collisions.
Result:                     8368 reads (100.00%)    26438251 bases (100.00%)

What is the -da flag doing?

ADD REPLY • link 7.3 years ago by pedroivo000 ▴ 110

1

Entering edit mode

That tells the program to disable assertions. I put some assertions in the code to make sure everything was as I expected. In this case, it was finding an overlap between to contigs just fine. But the assertion told it to try reverse-complementing both contigs and then make sure it still found the same overlap, just to make sure everything is perfect. And it didn't find the overlap anymore, so the assertion failed, which made it crash indicating exactly which line the problem was found on and some additional information. That helps me debug it in the future (well, as long as I have the full sequence printed out in the error messages, otherwise I can't replicate it).

Normally I run with a fixed number of substitutions or edits allowed, like "e=26" to allow up to 26 edits, rather than using the minidentity flag, which can be a little unstable. So I guessed that might be causing the problem. I'll have to look into it and see if I can figure out what the issue is... do you mind posting the full length sequences from one or two of the error messages so I can try to replicate it? In the mean time, you can use the "-da" flag (though in that case I can't ensure the output is perfectly correct) or use a combination of "s=" or "e=" instead of the minidentity flag.

Thanks, Brian

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

2

Entering edit mode

Is the assertion failing because the threads doing the jobs are running on physically separate nodes and losing communication? Any idea why thread numbers > 48 appear in error logs when there are 24 cores available per node?

pedroivo000 : I had a look at the link for cluster config you had posted. It seems that there are 8 interactive nodes. I am not sure if they are treated as a pool or you are actually logging into just one of them when you ask for an interactive queue. Your cluster users PBS job scheduler and there are examples of how you use it. Requesting exclusive use of one node to constrain your job to a single node may be what is needed. It may be best to chat with your local server support to see how you can do that.

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

Oh, that's an interesting possibility. I'm not familiar with PBS, so I'll see if I can replicate it when I run it in a shared-memory single node.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

All right, thank you very much genomax2 ! I'll contact them to see if that is the case. I'll have to wait a few days as the help desk are with reduced staff until the beginning of classes, though.

ADD REPLY • link 7.3 years ago by pedroivo000 ▴ 110

0

Entering edit mode

Hi Brian, thank you very much for your replies :) I can't post the sequences on a comment because they exceed the maximum number of characters of a comment, but here is the complete STDERR output from the run: https://drive.google.com/a/vt.edu/file/d/0B5JOeqpsqn8UZ1l5TnpVLWNpU28/view?usp=sharing

Regarding the s=, e= and mid= flags, the values you can set for them are highly dependent on the input data quality, right? The assemblies were made using Miseq 2x250bp paired-end reads using pretty much all standard settings for the assemblers I listed on the post. The problem with my initial dataset was that a linear amplification step was done prior to the library construction in order to increase the amount of initial DNA. That introduced PCR bias on the library and a lot of duplicated reads, which were removed before assembly. I also removed reads that mapped to bacterial contaminant genomes. I am not sure how these pre-assembly steps would affect sequence quality during sequencing and assembly. I known Illumina reads have a low error probability and an A-T substitution error bias if I am not mistaken. Is there any recommendation of values for the s= and e= in this case?

Sorry if this is a long question to ask in a comment, I can create a new question if needed.

ADD REPLY • link 7.3 years ago by pedroivo000 ▴ 110

1

Entering edit mode

I was able to download the file, and I'm about to go home, but I'll take a look at it tomorrow.

As for "s=" and etc. flags - they are not really related to the quality of the reads, but of the assembly. Since assemblies are consensus, they should be much higher quality than the reads. So for running Dedupe on assemblies, depending on the assembler (and ploidy of the organism), I'd expect an error rate of on the order of 0.1% for a haploid. For a diploid it could be much higher though. I'd suggest trying a longer minoverlap than 100 (perhaps 500) and setting, perhaps, "s=10 e=5" or "minidentity=99.5" with the -da flag and see what seems to give the best results (the assertion errors, if they are a bug, are likely very minor). But it really depends on the nature of your organism, and the quality of your assemblies. You might try shredding one and mapping it to another to get an idea of the average identity between assemblies; I would hope for it to be in excess of 99.8%.

There are also a few other pre-assembly techniques that might be helpful, if you look at /bbmap/docs/guides/PreprocessingGuide.txt. That's not really a step-by-step best practices for assembly (I should write one), but you should consider adapter-trimming, read-merging, and error-correction if you have not already done so.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

1

Entering edit mode

I'll try your suggestions to see what happens :) Thanks for the input! I really love BBTools, by the way.

ADD REPLY • link 7.3 years ago by pedroivo000 ▴ 110