Question: How do Read Alignment tools detect PCR duplicates?
gravatar for madischapel
4 weeks ago by
madischapel0 wrote:

Hi everyone, I'm working on a term project involving read alignment tools, and I had a question regarding how these programs detect and report PCR duplicates.

As I understand it, a proportion of PCR duplicates will be false positives. One read from the pair may have a sequence identical to other reads, but if the other half of the pair aligns at a different region of the genome, it's not a true PCR duplicate, as it wouldn't originate from the same DNA fragment. And programs like FastQC only consider one read at a time, without looking at the paired end data.

But the SAM output from read alignment tools also contains a flag for PCR duplicates. When flagging a PCR duplicate, do read alignment tools look only at individual reads, or do they take into consideration the position of the other pair when the reads come from a paired-end library?

If anyone could give more insight into this I would appreciate it!

alignment • 187 views
ADD COMMENTlink modified 4 weeks ago by Friederike3.6k • written 4 weeks ago by madischapel0

Take a look at in the thread here: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files One advantage you don't need align the data to identify duplicates.

If both reads have identical sequence on fragments you are sequencing then there is a chance they are PCR duplicates. You can't be 100% sure until you use UMI's in your library prep to label individual RNA molecules.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k

Keep in mind also that the label "PCR duplicate" is a bit misleading. In fact, it refers to positional duplicates, i.e. reads or read pairs with identical alignment coordinates. As far as I know, in typical Illumina sequencing libraries there is no way to tell apart positional duplicates from PCR duplicates.

ADD REPLYlink written 4 weeks ago by dariober10.0k
gravatar for Friederike
4 weeks ago by
United States
Friederike3.6k wrote:

To sum up what swbarnes and genomax wrote:

  1. alignment tools usually don't change the FLAG entry related to whether a given read may be a duplicate or no
  2. FastQC never changes anything in the fastq or bam files it is looking at
  3. commonly used tools that do detect duplicates are, for example, samtools markdup, PICARD's MarkDuplicates, the clumpify tool mentioned by genomax etc. Different tools may handle specific details differently, so if you need to know for absolutely sure it probably pays off to read the documentation of the tool you settle on, but generally the consensus is that, for paired-end reads, both reads of a pair will be taken into consideration.
ADD COMMENTlink written 4 weeks ago by Friederike3.6k
gravatar for swbarnes2
4 weeks ago by
United States
swbarnes25.2k wrote:

Fastqc is not an aligner. It's reporting any sequences it sees over and over again, as that might be a quality issue.

Many aligners don't touch the PCR duplicate flag. Most people use programs like Picard Tools to flag PCR duplicates after alignment. Picard Tools is smart enough to understand to use both reads of a pair if told to do so.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by swbarnes25.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1195 users visited in the last hour