Trying to use BBMap clumpify to remove duplicates in WGS Illumina PE reads. But its throwing an error:
Exception in thread "Thread-6" Exception in thread "Thread-5" Exception in thread "Thread-5" Exception in thread
"Thread-6" java.lang.AssertionError: SRR9845570.201 D00656:415:HYN72BCX2:1:1108:1112:3886 length=151
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
at clump.KmerComparator.hash_inner(KmerComparator.java:79)
at clump.KmerComparator.hash(KmerComparator.java:70)
at clump.KmerComparator.hash(KmerComparator.java:66)
at clump.KmerSort$FetchThread1.run(KmerSort.java:394)
java.lang.AssertionError: SRR9845570.1 D00656:415:HYN72BCX2:1:1108:1385:2052 length=151
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
java.lang.AssertionError: SRR9845571.1 D00656:382:HT5CFBCX2:1:1106:1366:2160 length=101 at
clump.KmerComparator.hash_inner(KmerComparator.java:79)
at clump.KmerComparator.hash(KmerComparator.java:70)
at clump.KmerComparator.hash(KmerComparator.java:66)
at clump.KmerSort$FetchThread1.run(KmerSort.java:394)
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
at clump.KmerComparator.hash_inner(KmerComparator.java:79)
at clump.KmerComparator.hash(KmerComparator.java:70)
at clump.KmerComparator.hash(KmerComparator.java:66)
at clump.KmerSort$FetchThread1.run(KmerSort.java:394)
and it goes on.
The code I am using is:
for ((i=0;i<$TotalSamples;i++))
do
printf "\n Removing duplicate reads in sample ${fileIA1[$i]} and sample ${fileIA2[$i]} \t"
clumpify.sh in1=${fileIA1[$i]} in2=${fileIA2[$i]} out1=deduped_"${fileIA1[$i]}".fq out2=deduped_"${fileIA2[$i]}".fq
dedupe=t dupedist=2500 subs=5 -Xmx50g &
done
Is it common issue with clumpify or there is a problem with code? The system I am running the script on has 56 CPU cores and 256GB RAM.
In addition to what @Mensur said below adding
subs=5
is likely going to need more memory than 50G depending on the size of your input dataset. You will find out soon enough if 50G is not enough.I suggest you remove the ampersand (to not send the job in background) and use all 200G of RAM per job to send them through serially. They are likely to complete quicker this way. You should also add
threads=18
to use multiple threads per job. (experiment with this number to see what the optimum combination turns out to be of threads and memory).