Question

Does 5' read trimming affect duplicate detection?

0

Entering edit mode

3.1 years ago

mcrepeau ▴ 20

My understanding is that PCR duplicate detection relies on detecting pairs of reads that share identical 5'-end coordinates and orientations. Does quality trimming of the 5' ends of reads prior to alignment interfere with this (post-alignment) duplicate detection? For example, if we have 2 read pairs (call them A and B) resulting from PCR duplication, and read 1 of pair A gets a few bases trimmed off the 5'-end because they are low quality, but in pair B the same bases are higher quality and do not get trimmed, then the 5'-end of pair A read 1 would have a different coordinate than the 5'-end of pair B read 1. Wouldn't this defeat the duplicate detection algorithm? Or is there something I'm missing here?

NGS • 1.1k views

ADD COMMENT • link updated 3.1 years ago by GenoMax 141k • written 3.1 years ago by mcrepeau ▴ 20

score 1 · Answer 1 · 2021-03-22

1

Entering edit mode

3.1 years ago

GenoMax 141k

It depends on how the duplicates are being defined. clumpify.sh from BBMap suite allows duplicate detection without alignments (see: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). clumpify includes settings that allow containment. That can be used in this case

containment=f       Allow containments (where one sequence is shorter).
affix=f             For containments, require one sequence to be an affix
                    (prefix or suffix) of the other.

ADD COMMENT • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

Hmm... I was thinking more of post-alignment duplicate marking, like with Picardtools. But doing it on the raw reads is an interesting option. More than one way to skin a cat I guess.

ADD REPLY • link 3.1 years ago by mcrepeau ▴ 20

0

Entering edit mode

It would be best to do this on original reads since aligners will soft-clip bases and you may lose important information in reported alignments. You can also find other kinds of duplicates (e.g. optical) with original reads.

ADD REPLY • link 3.1 years ago by GenoMax 141k

score 0 · Answer 2 · 2021-03-22

It depends on what type of duplicate detection you are using.

Trimming could interfere with the process if the duplicate detection is strictly based on sequence identity.

There are other types of duplicate detectors that look at the coordinates where reads or read pairs match. In those cases identical alignments are treated as duplicates.