Question: Mark duplicates for single end reads: Why only 5'end?
1
gravatar for dariober
2.3 years ago by
dariober8.0k
Glasgow - UK
dariober8.0k wrote:

Hi All,

Again on the issue of marking duplicates...

If you have single end reads tools like Picard MarkDuplicates mark as duplicates reads sharing the same 5'end coordinates and ignore the 3' coordinate (see here). Is this the right thing to do?

I can see the point in looking at only 5' coordinates when the read length was quite a bit shorter then the fragment length (say 35-70 bp) and sequence quality dropped after ~50bp.

But now you can get reads up to 300bp, which is well into or above the range of the fragment length. So reads sharing the same 5'end but different 3'end might well be genuine independent fragments and MarkDuplicates would get them wrong. Am I missing something? Does anybody know of a tool that looks at both 5' and 3' ends for SE reads?

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by dariober8.0k

I think the 3' end isn't taken into consideration in the Picard algorithm because read quality decays variably at the 3' end of a read. Aligners such as BWA usually soft clip these low-quality bases. Thus two reads which truly were PCR duplicates might not be marked as such if one had a different soft-clipping pattern. If you took into account the 3' end, you would introduce a higher proportion of false negatives as a result.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Dan D6.1k

Hi- I see your point, but that's what I'm questioning. Nowadays read quality stays high for up to 300bp (MiSeq, I've seen it!) so I think ignoring 3'ends has become overly conservative, especially if your fragment size is quite small ~100-200bp.

ADD REPLYlink written 2.3 years ago by dariober8.0k
What kind of information do you expect in the 300 bases of read along a 200 base fragment? The last 100 are nonsense?
ADD REPLYlink written 2.3 years ago by karl.stamm3.2k

Last 100 bases would be adapter sequence which would be trimmed before further processing (i.e. before alignment).

ADD REPLYlink written 2.3 years ago by dariober8.0k
1
gravatar for dariober
2.3 years ago by
dariober8.0k
Glasgow - UK
dariober8.0k wrote:

An update on this... I ended up writing a program to mark duplicates sharing start and end coordinates on the same strand and read group. The program is here MarkDupsByStartEnd.jar if anyone is interested.

@Kamil, I'm afraid samtools rmdup also look for 5'end only.

ADD COMMENTlink written 2.3 years ago by dariober8.0k
0
gravatar for Kamil
2.3 years ago by
Kamil1.5k
Boston
Kamil1.5k wrote:

I think your reasoning is correct regarding 300 bp reads.

It seems to me that samtools rmdup does use both 5' and 3' ends.

Other related discussions:

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Kamil1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 584 users visited in the last hour