How does samtools markdup works?
3
3
Entering edit mode
2.0 years ago
s.singh ▴ 50

Hi,

I am using samtools markdup -r parameter to remove PCR duplicates from my mapped ChIP reads. How does it detects the PCR duplicated reads?

Please let me know if I am understanding it correctly. This is what I understood:

I need to name sort the mapped reads and put MS-MC tags (using fixmate) and then I need to coordinate sort the reads before using the "markdup -r" command. If the chromosome number and 5' end of the read are the same, it will treat them as duplicate and by looking at the MC and MS scores it will choose the best-read among the duplicates and delete the rest of the reads (I am using "-r" parameter).. Is this what is happening here?

Any kind of help will be highly appreciated.

Thanks

-S

samtools markdup • 4.8k views
ADD COMMENT
7
Entering edit mode
2.0 years ago
jkbonfield ▴ 730

Samtools markdup is written to match Picard 2.10.3 (also Biobambam's bamstreamingmarkduplicates) so if you can find documentation on those then it should also apply to Samtools.

It may seem like a complex dance to have both name and position sorted requirements, but this is perhaps due to a traditional view and attempting to do markdup as an after thought.

The design leverages the fact that the data produced by the sequencing instrument and subsequent aligner will be name collated (not sorted, but collated is all it actually needs - we should probably fix the man page). So while it is in this order we can add additional tags regarding the other ends of reads in an efficient manner. Then after positional sorting markdup has sufficient information to make the assessment which it can now also do in an efficient manner.

When viewed in that perspective, marking duplicates actually has zero additional sort steps than we're already using in our standard pipelines. Hopefully that helps to explain the somewhat unusual approach.

As for the exact maths, I don't know all the intricacies, but I think it's pretty much as you describe. There is more information here:

https://www.htslib.org/doc/samtools-markdup.1.html

https://www.htslib.org/algorithms/duplicate.html

ADD COMMENT
0
Entering edit mode

Thanks a lot for the clarification. This helps a lot!

ADD REPLY
3
Entering edit mode
2.0 years ago
chaudharyc61 ▴ 60

To use samtools markdup there's a set of commands i believe you should use. These are the commands, I used for my data.

samtools sort -n -o Sorted_names.bam -O BAM Alignment.bam

explanation: This command will sort your Alignment file based on the names of reads.

samtools fixmate -m Sorted_names.bam Fixmate.bam

explanation: this will fill in mate coordinates and insert size fields

samtools sort -o Sorted.bam Fixmate.bam

explanation: This will sort based on chromosome number and coordinates

samtools markdup -r -s Sorted.bam Final_File.bam

explanation: This will remove all the duplicates and also print some basic stats about the result file

ADD COMMENT
2
Entering edit mode

Or in a piped fashion to avoid intermediate files directly after alignment:

alignment (options...) \
| samtools fixmate -m - - \
| samtools sort -O BAM \
| tee out_sorted_withDuplicates.bam \
| samtools markdup -r - out_sorted_Markdup.bam
ADD REPLY
2
Entering edit mode
2.0 years ago
ATpoint 57k

The sorting/marking/removing procedure is for paired-end data, as documented in the manual.

For single-end data you can directly feed in your coordinate-sorted BAM file. I quickly tested this on a dummy BAM file where I simply used the first read twice to make a duplicate. It is coordinate-sorted:

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6