Forum: Do you need a deduplication tool for FASTQ data in fastp?
gravatar for chen
21 months ago by
chen2.1k wrote:

Hi, I am the author of fastp, a tool to provide ultra-fast all-in-one FASTQ preprocessing functions.

This tool has received 500+ stars in github (, and has been cited for 40+ times since its paper published in Bioinformatics about 8 months ago.

Now I am considering to add a deduplication function to it. This may require some effort to implement it. So I think I should ask the users here, whether people need this feature.

You replies will be very appreciated. I will continue to improve this tool.

deduplication fastp forum • 1.6k views
ADD COMMENTlink modified 13 months ago by manekineko140 • written 21 months ago by chen2.1k

chen : Can I make an unrelated suggestion?

If you are looking for a new programming challenge then consider creating a data simulator that can generate data with UMI's. Think about creating data for single cells, cell types, 10X etc. AFAIK there is nothing available that can do this now.

I concur with @Devon's point below but the nature of the data necessitates use of extreme amounts of RAM (I have used over TB for NovaSeq data with clumpify).

ADD REPLYlink modified 21 months ago • written 21 months ago by GenoMax96k

Thanks, I will consider your suggestion.

For deduplication, I think I can control the RAM usage to be less than 16G for processing even 1Tbp Illumina PE data.

ADD REPLYlink written 21 months ago by chen2.1k

After 7 months how is the landscape? Is there a tool for extract UMIs and deduplication on FASTQ level? I have workflow and I need to have deduplication before mapping and BAM?

ADD REPLYlink written 13 months ago by manekineko140

They've recently added a gencore repository, which might be able to do that. I haven't used this yet, I just remember merging in the bioconda recipes recently.

Update: I guess this takes BAM files, so it's not relevant.

ADD REPLYlink modified 13 months ago • written 13 months ago by Devon Ryan98k

manekineko : If you need de-duping before mapping your best bet is still: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files from BBMap suite.

ADD REPLYlink written 13 months ago by GenoMax96k

Hi, can you tell me if fastp effectively remove duplicates or just count them?


./fastp --version 
fastp 0.21.0
ADD REPLYlink modified 7 weeks ago by Ram32k • written 7 weeks ago by leonardo.rippel20
gravatar for Istvan Albert
21 months ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

I will say that de-duplication is a far more complex concept than what people/end users initially assume. Even interpreting the meaning of a deduplication plot is far from trivial - I had to give it two tries myself.

In the early times of sequencing the coverages were low, the sequencing process error-prone, tools were unable to cope with identical reads - and just about all duplicates were artificial. Today the coverages are much higher the occurrence of natural duplicates far more prevalent. SNP calling tools can recognize and deal with artificial duplicates from the data itself. Thus need to deduplicate reads is less critical.

That being said if you can write a fast and efficient read deduplicator, there is most certainly room for that. Especially if it would integrate with an existing toolset (fastp). The very fact that a new fastq processor can be successful after all these years demonstrates that there is always room for a well-written tool.

I will also concur with genomax that a read data simulator would also be something that would help a lot of people. Today the field is very fragmented, one needs a different tool for each target and the usages are clumsy.

ADD COMMENTlink written 21 months ago by Istvan Albert ♦♦ 86k

Thanks for your advice.

ADD REPLYlink written 21 months ago by chen2.1k
gravatar for Devon Ryan
21 months ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

If you're going to implement something along those lines, model it after clumpify from bbmap, wherein optical duplicates are what are marked and the distance between clusters for calling duplicates is user modifiable. Marking optical duplicates is one of the few instances where duplicates should be marked directly on fastq files. As an aside, clumpify usually works very well and very quickly. There are a few cases (usually when the rate of optical duplication is quite high) that it uses hundreds of GB of RAM and eventually crashes. If you can come up with something that has similar performance (in terms of time) but has lower worst-case memory requirements then that'd be awesome.

ADD COMMENTlink written 21 months ago by Devon Ryan98k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour