Question: Removing PCR duplicates - fastq or BAM?
1
gravatar for rbronste
20 months ago by
rbronste230
rbronste230 wrote:

Wondering about pros/cons of removing duplicates from the raw fastq files vs the raw BAM alignment? Thanks.

bam alignment fastq duplicates • 2.2k views
ADD COMMENTlink modified 20 months ago by Istvan Albert ♦♦ 79k • written 20 months ago by rbronste230
4

Less work if you dedupe up front. clumpify.sh (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) from BBMap does this without a need for alignments.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax64k
3

Though personally I greatly prefer Clumpify for duplicate removal, mapping-based approaches can be more robust to reads with lots of errors (if you consider those duplicates). But in addition to the increased time, mapping-based also has the disadvantage of a lossy conversion to sam/bam format which typically chops off some of the original header (everything after the first whitespace).

I think some mapping-based deduplication tools may not be robust to read pairs that map to different chromosomes, or when only one read is mapped, and certainly not when neither read is mapped. I wrote a mapping-based deduplication program that handles duplicates in the first two scenarios, but as a result it uses a lot of memory. My recollection was that one of samtools or GATK handled duplicates of pairs mapped to different chromosomes, and the other didn't. And as for unmapped reads - some aligners will not map reads that have a lot of adapter sequence, even if they came from the correct genome, so those short-insert reads would not be deduplicated based on mapping the raw reads.

Multi-mapping reads can also pose a problem to mapping-based deduplication methods, depending on how the aligner handles ambiguity (e.g. non-determinsitic assignment is common), as can split alignments, which are produced by some aligners.

ADD REPLYlink modified 20 months ago • written 20 months ago by Brian Bushnell16k
2

To support your contention, picard misses PE duplicates with mates mapping to different chromosomes.

ADD REPLYlink written 20 months ago by Devon Ryan88k

Ah, thanks, Picard was indeed what I was thinking of.

ADD REPLYlink written 20 months ago by Brian Bushnell16k

Have you got a reference for that? I've read that Picard's MarkDuplicates can handle inter-chromosomal pairs and Samtool's rmdup cannot:

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965708/
  2. https://sourceforge.net/p/picard/wiki/Main_Page/#q-what-is-the-difference-between-markduplicates-and-samtools-rmdup
ADD REPLYlink modified 20 months ago • written 20 months ago by James Ashmore2.6k
1

Plain observation. I've recently been improving the duplicate marking in deepTools and this is one of the few sources of difference between it and picard in the output. So even if they document catching them, they don't always.

ADD REPLYlink modified 20 months ago • written 20 months ago by Devon Ryan88k

Interesting and quite surprising, I'll double-check my data

ADD REPLYlink written 20 months ago by James Ashmore2.6k
2
gravatar for Istvan Albert
20 months ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

Basically duplicates are of two kinds:

  • natural duplicates - caused by the biological system producing identical DNA fragments
  • artificial duplicates - caused by the sequencing instrument producing identical DNA fragments

Of course, we'd want to keep the first kinds of duplicates and remove the second kinds. But rarely if ever is a clear distinction possible between the two situations. Hence the conundrum.

While we are at it, an empirical observation that I made is that data with high rates of artificial duplication is often useless even after fixing this problem. Many other problems turn up. So it does not really matter what you do with it - remains useless.

In general, from what I understand, people tend to deduplicate their data where a uniform coverage is expected across the genome and when the coverage over a given position has major implications regarding the results. For example in SNP calling the number of reads supporting a variant is an essential decision maker in trusting that variant. We'd want to avoid using artificial duplicates there.

In most other cases, and especially when the expected coverages vary wildly and there are reasons for a fragment to occur very frequently (highly expressed short transcript in a transcriptome study) duplicate removal is not recommended.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Istvan Albert ♦♦ 79k

Thanks for the breakdown, I am doing ATAC-seq so trying to understand the overall pros/cons in that application.

ADD REPLYlink written 20 months ago by rbronste230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1219 users visited in the last hour