Question

Picard tools duplicate removal

1

Entering edit mode

6.5 years ago

blur ▴ 280

Hi, I want to use PICARD tools markduplicates option, but after reading the manual I am still not sure I understand the method used. http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates It reads: "The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file"

Does this mean duplicates are marked based on their chr+start position and the 5'-sequence? or does the tool take the full sequence into account by using the CIGAR data?

Thanks in advance.

RNA-Seq picard-tools • 11k views

ADD COMMENT • link 6.5 years ago by blur ▴ 280

1

Entering edit mode

Keeping in mind @ATPoint's note, if you do want to remove PCR/optical duplicates for other reasons then use Clumpify (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) It does not need the data to be aligned and works from sequences.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Will the answer to this question influence your decision to use it or not in any way?

ADD REPLY • link 6.5 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

Yes. Duplicate removal had influenced my results dramatically in the past.

ADD REPLY • link 6.5 years ago by blur ▴ 280

2

Entering edit mode

Hope you do not want to remove duplicates from RNA-seq data, as the tags of your post suggest?

ADD REPLY • link 6.5 years ago by ATpoint 82k

1

Entering edit mode

That is exactly why this operation is so dangerous. You better be sure that the removed duplicates are all artificial and not a natural effect of the high coverage.

There is a common myth floating around that "duplicates" are a synonym of "error". That is a remnant of the past when coverages were typically low.

ADD REPLY • link 6.5 years ago by Istvan Albert 100k

0

Entering edit mode

I don't doubt you, but do you have a source for this? I am new to RNA seq and what I have read is inline with the "myth" you're referring to. I would like to know more about whether or not I should be removing duplicates.

ADD REPLY • link 3.2 years ago by emalekos ▴ 20

1

Entering edit mode

You can google for papers (mostly newer ones) which used Unique Molecular Identifiers (UMIs) to investigate how many of the observed duplicates are actually based on PCR redundancy and which are based on coverage. The current consensus, from what I know, is that in targeted assays one generally does not remove duplicates as it would remove too many non-technical duplicates. I in fact know of no pipeline that would remove RNA-seq duplicates, here it is generally well accepted to go with the reads/counts as-observed rather than deduplicating the experiment.

ADD REPLY • link 3.2 years ago by ATpoint 82k

1

Entering edit mode

See https://dnatech.genomecenter.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data/ for references

ADD REPLY • link 3.2 years ago by GenoMax 141k