Duplication Removing Using Filtering Xt:A:U, Samtools -Rmdup And Picard Markduplicates
3
3
Entering edit mode
10.5 years ago
Tonyzeng ▴ 310

HI, I have a question here since I am little confused for the duplication filtering by using different tools as a beginner.

  1. Most of my SAM format read sequences have the 12th optional field (TAG:TYPE:VALUE), My understand is that XT:A:U means that this read mapped to reference uniquely. As a result, if I filter all the reads on the XT:A:U, I have deleted all the potential duplications and then I do not need to use rmdup of Samtool anymore???

  2. Picard markduplications can only marker or label the reads that are possible duplications, so why we just marker it but not delete all duplicates?

  3. Suppose Picard and samtools can help to remove duplications, which one is more reliable?

Thank you

samtools picard markduplicates • 5.9k views
ADD COMMENT
0
Entering edit mode

forget to ask one more question, when I filter XT:A:U, this is risk that one of the reads of pair-end reads will be removed but keep another in the data, right?

ADD REPLY
3
Entering edit mode
10.5 years ago

1) A unique mapping means that a given read could not be aligned to another position in the genome with the same alignment score. In other words, there are no other alignments that can get comparable scores to the best one. It doesnt mean that a given read spans in a region in genome that no other read spans.You will still need to mark the duplicates even if you only consider the XT:A:U tag or only uniquely aligned reads. In other words, if there are two reads both uniquely aligned but have same start positions on genome then one will have to be filtered out.

2) Some people still like to keep the duplicates in their bam file. SNP callers wont consider these duplicate reads (marked duplicates) for calling variants so keeping them in your bam file wont hurt. But you can also remove them if you want to reduce the size. But I keep them.

3) I would go with Picard. It is fast. See this post: samtools rmdup and picard mark duplicates

ADD COMMENT
1
Entering edit mode

Thanks, Ashutoshmits, your answer makes sense to me

ADD REPLY
1
Entering edit mode
10.5 years ago

Picard has an option to remove duplicate reads instead of just marking them. Regarding unmapped reads, its good to have them because there might come some tools that make use of unmapped reads to infer some intresting facts from your data . Regarding picard Vs Samtools rmdup, both are quite good at removing/marking duplicate reads ( in PE data ), but picard could remove interchromosomal duplicate reads while its not the case with samtools rmdup.

ADD COMMENT
0
Entering edit mode
10.5 years ago
Tonyzeng ▴ 310

I gave up using filtering XT:A:U step because it produced un-pair-end reads.

ADD COMMENT

Login before adding your answer.

Traffic: 1867 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6