Mark duplicates for single end reads: Why only 5'end?
2
1
Entering edit mode
9.0 years ago

Hi All,

Again on the issue of marking duplicates...

If you have single end reads tools like Picard MarkDuplicates mark as duplicates reads sharing the same 5'end coordinates and ignore the 3' coordinate (see here). Is this the right thing to do?

I can see the point in looking at only 5' coordinates when the read length was quite a bit shorter then the fragment length (say 35-70 bp) and sequence quality dropped after ~50bp.

But now you can get reads up to 300bp, which is well into or above the range of the fragment length. So reads sharing the same 5'end but different 3'end might well be genuine independent fragments and MarkDuplicates would get them wrong. Am I missing something? Does anybody know of a tool that looks at both 5' and 3' ends for SE reads?

Read duplicates MarkDuplicates 5' 3'end • 6.5k views
ADD COMMENT
0
Entering edit mode

I think the 3' end isn't taken into consideration in the Picard algorithm because read quality decays variably at the 3' end of a read. Aligners such as BWA usually soft clip these low-quality bases. Thus two reads which truly were PCR duplicates might not be marked as such if one had a different soft-clipping pattern. If you took into account the 3' end, you would introduce a higher proportion of false negatives as a result.

ADD REPLY
0
Entering edit mode

Hi- I see your point, but that's what I'm questioning. Nowadays read quality stays high for up to 300bp (MiSeq, I've seen it!) so I think ignoring 3'ends has become overly conservative, especially if your fragment size is quite small ~100-200bp.

ADD REPLY
0
Entering edit mode
What kind of information do you expect in the 300 bases of read along a 200 base fragment? The last 100 are nonsense?
ADD REPLY
0
Entering edit mode

Last 100 bases would be adapter sequence which would be trimmed before further processing (i.e. before alignment).

ADD REPLY
1
Entering edit mode
9.0 years ago

An update on this... I ended up writing a program to mark duplicates sharing start and end coordinates on the same strand and read group. The program is here MarkDupsByStartEnd.jar if anyone is interested.

@Kamil, I'm afraid samtools rmdup also look for 5'end only.

ADD COMMENT
0
Entering edit mode
9.0 years ago
Kamil ★ 2.3k

I think your reasoning is correct regarding 300 bp reads.

It seems to me that samtools rmdup does use both 5' and 3' ends.

Other related discussions:

ADD COMMENT

Login before adding your answer.

Traffic: 2769 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6