Question: Picard tools duplicate removal
5 months ago
European Union
blur80 wrote:

Hi, I want to use PICARD tools markduplicates option, but after reading the manual I am still not sure I understand the method used. It reads: "The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file"

Does this mean duplicates are marked based on their chr+start position and the 5'-sequence? or does the tool take the full sequence into account by using the CIGAR data?

Thanks in advance.

rna-seq picard-tools • 1.1k views
5 months ago by blur80

Will the answer to this question influence your decision to use it or not in any way?

5 months ago by YaGalbi

Yes. Duplicate removal had influenced my results dramatically in the past.

5 months ago by blur80

Hope you do not want to remove duplicates from RNA-seq data, as the tags of your post suggest?

5 months ago by ATpoint

That is exactly why this operation is so dangerous. You better be sure that the removed duplicates are all artificial and not a natural effect of the high coverage.

There is a common myth floating around that "duplicates" are a synonym of "error". That is a remnant of the past when coverages were typically low.

5 months ago by Istvan Albert

Keeping in mind @ATPoint's note, if you do want to remove PCR/optical duplicates for other reasons then use Clumpify (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) It does not need the data to be aligned and works from sequences.

5 months ago by genomax
