Working with dedupe from the BBMAP suite (highly appreciated suite!) to estimate/determine exact duplicates in Illumina PE150 metagenomic data. Potential overamplification bias in PCR at library prep.
Two things I hope to get some clues on:
1) can dedupe output the console output to a stats file? It seems to write those stats to the STDERR channel and not to STDOUT so simply piping the output to a log file is not working. Neither does piping stderr to stdout 2>&1. I temporarely solved it by running the tool in a logged gnu screen session but that is silly. I just want the total reads and the found exact duplicates in a stats file.
2) Why does a small set of two 10Gb each file run out of memory on a 64Gb node? I gave java all memory I could find. On a 128Gb node it runs over 95Gb at which I had to terminate it. How to run those? A simple GNU sort | uniq on just the sequences only consumed less than 15Gb memory. I do want to run this de novo without reference mapping.
That should write the stats to a file for
dedupe.sh
.Look at
clumpify.sh
as an option since it uses temp files to hold data (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). There are ways to slice and dice the duplicates in many ways AND it does not require a reference/alignments. You may still run out of memory depending on the size of the data.clumpify might indeed do what I want. I did play around with the clusterstats file but didn't get what I expected. I will reinvestigate it.
If you only need statistics then do not provide
out=
option(s) toclumpify.sh
.