Question: Duplication Removing Using Filtering Xt:A:U, Samtools -Rmdup And Picard Markduplicates
3
gravatar for Tonyzeng
6.5 years ago by
Tonyzeng300
Tonyzeng300 wrote:

HI, I have a question here since I am little confused for the duplication filtering by using different tools as a beginner.

  1. Most of my SAM format read sequences have the 12th optional field (TAG:TYPE:VALUE), My understand is that XT:A:U means that this read mapped to reference uniquely. As a result, if I filter all the reads on the XT:A:U, I have deleted all the potential duplications and then I do not need to use rmdup of Samtool anymore???

  2. Picard markduplications can only marker or label the reads that are possible duplications, so why we just marker it but not delete all duplicates?

  3. Suppose Picard and samtools can help to remove duplications, which one is more reliable?

Thank you

picard samtools markduplicates • 4.4k views
ADD COMMENTlink modified 6.5 years ago by geek_y10k • written 6.5 years ago by Tonyzeng300

forget to ask one more question, when I filter XT:A:U, this is risk that one of the reads of pair-end reads will be removed but keep another in the data, right?

ADD REPLYlink written 6.5 years ago by Tonyzeng300
3
gravatar for Ashutosh Pandey
6.5 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

1) A unique mapping means that a given read could not be aligned to another position in the genome with the same alignment score. In other words, there are no other alignments that can get comparable scores to the best one. It doesnt mean that a given read spans in a region in genome that no other read spans.You will still need to mark the duplicates even if you only consider the XT:A:U tag or only uniquely aligned reads. In other words, if there are two reads both uniquely aligned but have same start positions on genome then one will have to be filtered out.

2) Some people still like to keep the duplicates in their bam file. SNP callers wont consider these duplicate reads (marked duplicates) for calling variants so keeping them in your bam file wont hurt. But you can also remove them if you want to reduce the size. But I keep them.

3) I would go with Picard. It is fast. See this post: samtools rmdup and picard mark duplicates

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Ashutosh Pandey12k
1

Thanks, Ashutoshmits, your answer makes sense to me

ADD REPLYlink written 6.5 years ago by Tonyzeng300
1
gravatar for geek_y
6.5 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

Picard has an option to remove duplicate reads instead of just marking them. Regarding unmapped reads, its good to have them because there might come some tools that make use of unmapped reads to infer some intresting facts from your data . Regarding picard Vs Samtools rmdup, both are quite good at removing/marking duplicate reads ( in PE data ), but picard could remove interchromosomal duplicate reads while its not the case with samtools rmdup.

ADD COMMENTlink written 6.5 years ago by geek_y10k
0
gravatar for Tonyzeng
6.5 years ago by
Tonyzeng300
Tonyzeng300 wrote:

I gave up using filtering XT:A:U step because it produced un-pair-end reads.

ADD COMMENTlink written 6.5 years ago by Tonyzeng300
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour