Working with dedupe from the BBMAP suite (highly appreciated suite!) to estimate/determine exact duplicates in Illumina PE150 metagenomic data. Potential overamplification bias in PCR at library prep.
Two things I hope to get some clues on:
1) can dedupe output the console output to a stats file? It seems to write those stats to the STDERR channel and not to STDOUT so simply piping the output to a log file is not working. Neither does piping stderr to stdout 2>&1. I temporarely solved it by running the tool in a logged gnu screen session but that is silly. I just want the total reads and the found exact duplicates in a stats file.
2) Why does a small set of two 10Gb each file run out of memory on a 64Gb node? I gave java all memory I could find. On a 128Gb node it runs over 95Gb at which I had to terminate it. How to run those? A simple GNU sort | uniq on just the sequences only consumed less than 15Gb memory. I do want to run this de novo without reference mapping.