Question

Duplication Removing Using Filtering Xt:A:U, Samtools -Rmdup And Picard Markduplicates

3

Entering edit mode

10.6 years ago

Tonyzeng ▴ 310

HI, I have a question here since I am little confused for the duplication filtering by using different tools as a beginner.

Most of my SAM format read sequences have the 12th optional field (TAG:TYPE:VALUE), My understand is that XT:A:U means that this read mapped to reference uniquely. As a result, if I filter all the reads on the XT:A:U, I have deleted all the potential duplications and then I do not need to use rmdup of Samtool anymore???
Picard markduplications can only marker or label the reads that are possible duplications, so why we just marker it but not delete all duplicates?
Suppose Picard and samtools can help to remove duplications, which one is more reliable?

Thank you

samtools picard markduplicates • 6.0k views

ADD COMMENT • link updated 10.5 years ago by GouthamAtla 12k • written 10.6 years ago by Tonyzeng ▴ 310

0

Entering edit mode

forget to ask one more question, when I filter XT:A:U, this is risk that one of the reads of pair-end reads will be removed but keep another in the data, right?

ADD REPLY • link 10.6 years ago by Tonyzeng ▴ 310

score 3 · Answer 1 · 2013-10-04

1) A unique mapping means that a given read could not be aligned to another position in the genome with the same alignment score. In other words, there are no other alignments that can get comparable scores to the best one. It doesnt mean that a given read spans in a region in genome that no other read spans.You will still need to mark the duplicates even if you only consider the XT:A:U tag or only uniquely aligned reads. In other words, if there are two reads both uniquely aligned but have same start positions on genome then one will have to be filtered out.

2) Some people still like to keep the duplicates in their bam file. SNP callers wont consider these duplicate reads (marked duplicates) for calling variants so keeping them in your bam file wont hurt. But you can also remove them if you want to reduce the size. But I keep them.

3) I would go with Picard. It is fast. See this post: samtools rmdup and picard mark duplicates

score 1 · Answer 2 · 2013-10-26

Picard has an option to remove duplicate reads instead of just marking them. Regarding unmapped reads, its good to have them because there might come some tools that make use of unmapped reads to infer some intresting facts from your data . Regarding picard Vs Samtools rmdup, both are quite good at removing/marking duplicate reads ( in PE data ), but picard could remove interchromosomal duplicate reads while its not the case with samtools rmdup.

score 0 · Answer 3 · 2013-10-04

0

Entering edit mode

10.6 years ago

Tonyzeng ▴ 310

I gave up using filtering XT:A:U step because it produced un-pair-end reads.

ADD COMMENT • link 10.6 years ago by Tonyzeng ▴ 310