Question

BBMap Clumpify: Exception in thread "Thread -#"

0

Entering edit mode

4.7 years ago

aman.akash2008 • 0

Trying to use BBMap clumpify to remove duplicates in WGS Illumina PE reads. But its throwing an error:

    Exception in thread "Thread-6" Exception in thread "Thread-5" Exception in thread "Thread-5" Exception in thread 
   "Thread-6" java.lang.AssertionError: SRR9845570.201 D00656:415:HYN72BCX2:1:1108:1112:3886 length=151
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
at clump.KmerComparator.hash_inner(KmerComparator.java:79)
at clump.KmerComparator.hash(KmerComparator.java:70)
at clump.KmerComparator.hash(KmerComparator.java:66)
at clump.KmerSort$FetchThread1.run(KmerSort.java:394)
  java.lang.AssertionError: SRR9845570.1 D00656:415:HYN72BCX2:1:1108:1385:2052 length=151
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
    java.lang.AssertionError: SRR9845571.1 D00656:382:HT5CFBCX2:1:1106:1366:2160 length=101 at 
   clump.KmerComparator.hash_inner(KmerComparator.java:79)

at clump.KmerComparator.hash(KmerComparator.java:70)
at clump.KmerComparator.hash(KmerComparator.java:66)
at clump.KmerSort$FetchThread1.run(KmerSort.java:394)
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
at clump.KmerComparator.hash_inner(KmerComparator.java:79)
at clump.KmerComparator.hash(KmerComparator.java:70)
at clump.KmerComparator.hash(KmerComparator.java:66)
at clump.KmerSort$FetchThread1.run(KmerSort.java:394)

and it goes on.

The code I am using is:

    for ((i=0;i<$TotalSamples;i++))
do
     printf "\n Removing duplicate reads in sample ${fileIA1[$i]} and sample ${fileIA2[$i]}  \t"
     clumpify.sh in1=${fileIA1[$i]} in2=${fileIA2[$i]} out1=deduped_"${fileIA1[$i]}".fq out2=deduped_"${fileIA2[$i]}".fq 
     dedupe=t dupedist=2500 subs=5 -Xmx50g &
done

Is it common issue with clumpify or there is a problem with code? The system I am running the script on has 56 CPU cores and 256GB RAM.

genome Assembly alignment duplicate • 1.4k views

ADD COMMENT • link updated 4.7 years ago by Mensur Dlakic ★ 30k • written 4.7 years ago by aman.akash2008 • 0

0

Entering edit mode

In addition to what @Mensur said below adding subs=5 is likely going to need more memory than 50G depending on the size of your input dataset. You will find out soon enough if 50G is not enough.

I suggest you remove the ampersand (to not send the job in background) and use all 200G of RAM per job to send them through serially. They are likely to complete quicker this way. You should also add threads=18 to use multiple threads per job. (experiment with this number to see what the optimum combination turns out to be of threads and memory).

ADD REPLY • link 4.7 years ago by GenoMax 154k

score 0 · Answer 1 · 2021-03-14

You may want to remove the ampersand & at the end of clumpify.sh line. That sends the processes into background with each claiming 50 Gb, and if you have many files to processes that will choke up even a computer with 256 Gb of memory. Instead, try processing one file group at a time.

Just in case, the two line above that start with clumpify.sh and dedupe=t should be on a single line.