Question: How would you filter PCR duplicates for merged paired-end reads
0
gravatar for hohoku
12 months ago by
hohoku0
hohoku0 wrote:

I have some paired-end sequencing data that has a significant number of pairs that overlap due to small insert sizes. In my experience, merging the read pairs (and recalibrating with bbmap) results in better alignments. However, when it comes to the PCR duplicate removal step of the merged read pairs, I want to identify those alignments where both the 5' and 3' ends are identical, and none of the commonly used tools (samtools, sambamba, picard) appears to have this feature.

My question is, how would you filter for these overlapping read pairs that have been merged prior to alignment? I don't want to treat them as single-end reads for PCR duplicate removal as I may end up discarding information unnecessarily.

pcr duplicate bam alignment • 604 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by hohoku0
1
gravatar for hohoku
12 months ago by
hohoku0
hohoku0 wrote:

I found that the software Paleomix has the exact tool I was looking for.

paleomix rmdup_collapsed --remove-duplicates < sorted.bam > < out.bam >

ADD COMMENTlink written 12 months ago by hohoku0
0
gravatar for finswimmer
12 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

Hello,

there are several approaches to identify PCR duplicates in paired end sequencing:

  • compare 5' mapping positions of the read paires
  • compare the most 5' mapping positions of the read paires taking clipped bases into account
  • compare the sequence of the read paires

In my experience the results are more or less the same.

If working with merged overlapping reads one has the problem that there is a mixture of paired and single reads in the alignment. This is why I prefer removing duplicates based on there sequence prior merging the reads. A tool that can do this is clumpify.sh from bbtools:

$ clumpify.sh in=in_R1.fastq.gz in2=in_R2.fastq.gz out=out_R1.fastq.gz out2=out_R2.fastq.gz dedupe

fin swimmer

ADD COMMENTlink written 12 months ago by finswimmer12k

In my experience the results are more or less the same

Definitely agreed. Therefore, I strongly recommend to choose that approach/tool you feel most comfortable with and then proceed with the analysis to avoid wasting time on the duplicate issue.

ADD REPLYlink modified 12 months ago • written 12 months ago by ATpoint26k

I see your point, but I'm still interested in how to solve this problem. It only requires identifying alignments with identical 5' and 3' ends, so I thought someone here might know a neat way to do it.

ADD REPLYlink written 12 months ago by hohoku0

Yes, I am aware of clumpify and do like it, but sometimes when we sequence more of our sample at a later date, often substantially more, it is more practical to just merge bam files of all the lanes of sequencing than go back to the fastqs, merge them all, deduplicate and map again.

ADD REPLYlink written 12 months ago by hohoku0

but sometimes when we sequence more of our sample at a later date

From the same library as in the first sequencing run? Otherwise removing duplicates after merging wouldn't be correct.

fin swimmer

ADD REPLYlink written 12 months ago by finswimmer12k

Yes, same library... I'm not that bad

ADD REPLYlink written 12 months ago by hohoku0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1677 users visited in the last hour