Does 5' read trimming affect duplicate detection?
2
0
Entering edit mode
3.1 years ago
mcrepeau ▴ 20

My understanding is that PCR duplicate detection relies on detecting pairs of reads that share identical 5'-end coordinates and orientations. Does quality trimming of the 5' ends of reads prior to alignment interfere with this (post-alignment) duplicate detection? For example, if we have 2 read pairs (call them A and B) resulting from PCR duplication, and read 1 of pair A gets a few bases trimmed off the 5'-end because they are low quality, but in pair B the same bases are higher quality and do not get trimmed, then the 5'-end of pair A read 1 would have a different coordinate than the 5'-end of pair B read 1. Wouldn't this defeat the duplicate detection algorithm? Or is there something I'm missing here?

NGS • 1.1k views
ADD COMMENT
1
Entering edit mode
3.1 years ago
GenoMax 141k

It depends on how the duplicates are being defined. clumpify.sh from BBMap suite allows duplicate detection without alignments (see: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). clumpify includes settings that allow containment. That can be used in this case

containment=f       Allow containments (where one sequence is shorter).
affix=f             For containments, require one sequence to be an affix
                    (prefix or suffix) of the other.
ADD COMMENT
0
Entering edit mode

Hmm... I was thinking more of post-alignment duplicate marking, like with Picardtools. But doing it on the raw reads is an interesting option. More than one way to skin a cat I guess.

ADD REPLY
0
Entering edit mode

It would be best to do this on original reads since aligners will soft-clip bases and you may lose important information in reported alignments. You can also find other kinds of duplicates (e.g. optical) with original reads.

ADD REPLY
0
Entering edit mode
3.1 years ago

It depends on what type of duplicate detection you are using.

Trimming could interfere with the process if the duplicate detection is strictly based on sequence identity.

There are other types of duplicate detectors that look at the coordinates where reads or read pairs match. In those cases identical alignments are treated as duplicates.

ADD COMMENT

Login before adding your answer.

Traffic: 2810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6