Deduplication using UMItools
21 months ago
Ati ▴ 40

I have some RNAseq data with a high duplication rate but the reads have UMI (Unique Molecular Identifiers). The UMI length is 5 bp. I have used umitools dedup to remove duplications. When I checked the duplication with MarkDuplicates tools (Picard) still the duplication is a bit high for some samples.

I would expect to have a low or even zero % duplication rate after using UMItools. Is there any explanation?

Could the length of UMI be the reason?

21 months ago

Picard should be completely ignored if you have UMIs, as it doesn't use UMIs and will therefore give inflated duplication rates (picard reports PCR duplicates determined using the position of read ends, whereas umitools uses that information in addition to UMI sequence). If you have used umitools dedup then the actual duplication rate is 0, regardless of what picard may report.

@Devon Ryan Thank you! Even if the UMI length is short (5bp)?

yes

