Question

Picard MarkDuplicates help – how to find the number of duplicates removed

0

Entering edit mode

2.4 years ago

urvashi_s • 0

Hello,

It is my first time using Picard to remove duplicates, and here are some of the duplication metrics:

ESTIMATED_LIBRARY_SIZE: 24195090

PERCENT_DUPLICATION: 0.707214 (so ~70%)

READ_PAIR_DUPLICATES: 55689797 (~55.7M)

Histogram BIN1: 8692285 (so this is essentially the fragments present a single time)

In the metrics or the output file, how can I find the number of reads/fragments that have been removed?

Any help would be appreciated.

mark rna-seq picard duplicates • 972 views

ADD COMMENT • link updated 2.4 years ago by uli • 0 • written 2.4 years ago by urvashi_s • 0

score 0 · Answer 1 · 2021-12-11

From the metrics, the number of duplicates which were detected is the 55M number (READ_PAIR_DUPLICATES). You can find more information on the picard outputs here: https://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics. If run in default mode the program won't automatically remove duplicates, so you would need to add REMOVE_DUPLICATES=true.

Particularly for RNA-seq, though, whether to remove PCR duplicates at all is still debated and can sometimes do more harm than good: https://www.nature.com/articles/srep25533.