21 months ago by
University Park, USA
I will say that de-duplication is a far more complex concept than what people/end users initially assume. Even interpreting the meaning of a deduplication plot is far from trivial - I had to give it two tries myself.
In the early times of sequencing the coverages were low, the sequencing process error-prone, tools were unable to cope with identical reads - and just about all duplicates were artificial. Today the coverages are much higher the occurrence of natural duplicates far more prevalent. SNP calling tools can recognize and deal with artificial duplicates from the data itself. Thus need to deduplicate reads is less critical.
That being said if you can write a fast and efficient read deduplicator, there is most certainly room for that. Especially if it would integrate with an existing toolset (fastp). The very fact that a new fastq processor can be successful after all these years demonstrates that there is always room for a well-written tool.
I will also concur with genomax that a read data simulator would also be something that would help a lot of people. Today the field is very fragmented, one needs a different tool for each target and the usages are clumsy.