Question

Removing PCR duplicates - fastq or BAM?

2

Entering edit mode

6.8 years ago

rbronste ▴ 420

Wondering about pros/cons of removing duplicates from the raw fastq files vs the raw BAM alignment? Thanks.

alignment BAM fastq duplicates • 8.7k views

ADD COMMENT • link updated 6.8 years ago by Istvan Albert 100k • written 6.8 years ago by rbronste ▴ 420

4

Entering edit mode

Less work if you dedupe up front. clumpify.sh (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) from BBMap does this without a need for alignments.

ADD REPLY • link 6.8 years ago by GenoMax 141k

4

Entering edit mode

Though personally I greatly prefer Clumpify for duplicate removal, mapping-based approaches can be more robust to reads with lots of errors (if you consider those duplicates). But in addition to the increased time, mapping-based also has the disadvantage of a lossy conversion to sam/bam format which typically chops off some of the original header (everything after the first whitespace).

I think some mapping-based deduplication tools may not be robust to read pairs that map to different chromosomes, or when only one read is mapped, and certainly not when neither read is mapped. I wrote a mapping-based deduplication program that handles duplicates in the first two scenarios, but as a result it uses a lot of memory. My recollection was that one of samtools or GATK handled duplicates of pairs mapped to different chromosomes, and the other didn't. And as for unmapped reads - some aligners will not map reads that have a lot of adapter sequence, even if they came from the correct genome, so those short-insert reads would not be deduplicated based on mapping the raw reads.

Multi-mapping reads can also pose a problem to mapping-based deduplication methods, depending on how the aligner handles ambiguity (e.g. non-determinsitic assignment is common), as can split alignments, which are produced by some aligners.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

3

Entering edit mode

To support your contention, picard misses PE duplicates with mates mapping to different chromosomes.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Ah, thanks, Picard was indeed what I was thinking of.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Have you got a reference for that? I've read that Picard's MarkDuplicates can handle inter-chromosomal pairs and Samtool's rmdup cannot:

ADD REPLY • link 6.8 years ago by James Ashmore ★ 3.4k

1

Entering edit mode

Plain observation. I've recently been improving the duplicate marking in deepTools and this is one of the few sources of difference between it and picard in the output. So even if they document catching them, they don't always.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Interesting and quite surprising, I'll double-check my data

ADD REPLY • link 6.8 years ago by James Ashmore ★ 3.4k

score 9 · Answer 1 · 2017-07-19

Basically duplicates are of two kinds:

natural duplicates - caused by the biological system producing identical DNA fragments
artificial duplicates - caused by the sequencing instrument producing identical DNA fragments

Of course, we'd want to keep the first kinds of duplicates and remove the second kinds. But rarely if ever is a clear distinction possible between the two situations. Hence the conundrum.

While we are at it, an empirical observation that I made is that data with high rates of artificial duplication is often useless even after fixing this problem. Many other problems turn up. So it does not really matter what you do with it - remains useless.

In general, from what I understand, people tend to deduplicate their data where a uniform coverage is expected across the genome and when the coverage over a given position has major implications regarding the results. For example in SNP calling the number of reads supporting a variant is an essential decision maker in trusting that variant. We'd want to avoid using artificial duplicates there.

In most other cases, and especially when the expected coverages vary wildly and there are reasons for a fragment to occur very frequently (highly expressed short transcript in a transcriptome study) duplicate removal is not recommended.