Question: Picard tools duplicate removal
gravatar for blur
5 months ago by
European Union
blur80 wrote:

Hi, I want to use PICARD tools markduplicates option, but after reading the manual I am still not sure I understand the method used. It reads: "The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file"

Does this mean duplicates are marked based on their chr+start position and the 5'-sequence? or does the tool take the full sequence into account by using the CIGAR data?

Thanks in advance.

rna-seq picard-tools • 1.1k views
ADD COMMENTlink written 5 months ago by blur80

Will the answer to this question influence your decision to use it or not in any way?

ADD REPLYlink written 5 months ago by YaGalbi1.1k

Yes. Duplicate removal had influenced my results dramatically in the past.

ADD REPLYlink written 5 months ago by blur80

Hope you do not want to remove duplicates from RNA-seq data, as the tags of your post suggest?

ADD REPLYlink written 5 months ago by ATpoint3.2k

That is exactly why this operation is so dangerous. You better be sure that the removed duplicates are all artificial and not a natural effect of the high coverage.

There is a common myth floating around that "duplicates" are a synonym of "error". That is a remnant of the past when coverages were typically low.

ADD REPLYlink modified 5 months ago • written 5 months ago by Istvan Albert ♦♦ 75k

Keeping in mind @ATPoint's note, if you do want to remove PCR/optical duplicates for other reasons then use Clumpify (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) It does not need the data to be aligned and works from sequences.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1514 users visited in the last hour