Question

How does samtools markdup works?

4

Entering edit mode

4.3 years ago

s.singh ▴ 70

Hi,

I am using samtools markdup -r parameter to remove PCR duplicates from my mapped ChIP reads. How does it detects the PCR duplicated reads?

Please let me know if I am understanding it correctly. This is what I understood:

I need to name sort the mapped reads and put MS-MC tags (using fixmate) and then I need to coordinate sort the reads before using the "markdup -r" command. If the chromosome number and 5' end of the read are the same, it will treat them as duplicate and by looking at the MC and MS scores it will choose the best-read among the duplicates and delete the rest of the reads (I am using "-r" parameter).. Is this what is happening here?

Any kind of help will be highly appreciated.

Thanks

-S

samtools markdup • 12k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 4.3 years ago by s.singh ▴ 70

5

Entering edit mode

4.3 years ago

chaudharyc61 ▴ 90

To use samtools markdup there's a set of commands I believe you should use. These are the commands, I used for my data.

samtools sort -n -o Sorted_names.bam -O BAM Alignment.bam

Explanation: This command will sort your Alignment file based on the names of reads.

samtools fixmate -m Sorted_names.bam Fixmate.bam

Explanation: this will fill in mate coordinates and insert size fields

samtools sort -o Sorted.bam Fixmate.bam

Explanation: This will sort based on chromosome number and coordinates

samtools markdup -r -s Sorted.bam Final_File.bam

Explanation: This will remove all the duplicates and also print some basic stats about the result file

ADD COMMENT • link updated 18 months ago by Ram 43k • written 4.3 years ago by chaudharyc61 ▴ 90

2

Entering edit mode

4.3 years ago

ATpoint 81k

The sorting/marking/removing procedure is for paired-end data, as documented in the manual.

For single-end data you can directly feed in your coordinate-sorted BAM file. I quickly tested this on a dummy BAM file where I simply used the first read twice to make a duplicate. It is coordinate-sorted:

enter image description here

ADD COMMENT • link 4.3 years ago by ATpoint 81k

score 9 · Accepted Answer · 2020-01-10

Samtools markdup is written to match Picard 2.10.3 (also Biobambam's bamstreamingmarkduplicates) so if you can find documentation on those then it should also apply to Samtools.

It may seem like a complex dance to have both name and position sorted requirements, but this is perhaps due to a traditional view and attempting to do markdup as an after thought.

The design leverages the fact that the data produced by the sequencing instrument and subsequent aligner will be name collated (not sorted, but collated is all it actually needs - we should probably fix the man page). So while it is in this order we can add additional tags regarding the other ends of reads in an efficient manner. Then after positional sorting markdup has sufficient information to make the assessment which it can now also do in an efficient manner.

When viewed in that perspective, marking duplicates actually has zero additional sort steps than we're already using in our standard pipelines. Hopefully that helps to explain the somewhat unusual approach.

As for the exact maths, I don't know all the intricacies, but I think it's pretty much as you describe. There is more information here:

https://www.htslib.org/doc/samtools-markdup.1.html

https://www.htslib.org/algorithms/duplicate.html