Question

How to properly pass from samtools rmdup to samtools markdup

0

Entering edit mode

20 months ago

raggi.emanuele • 0

I am using for the first time this week Whippet.jl

A friend passed me an old pipeline and he used samtools rmdup. However I know from the documentation that command is obsolete and markdup should be used instead.

The old command was: samtools rmdup -S filename.sort.bam filename.sort.rmdup.bam

I am trying to pass to the new notation and I think that this could be the equivalent: samtools markdup -S filename.sort.bam filename.sort.markdup.bam

I need to compare the results but I wanted to make sure that those commands do the same thing. The file are enormous and before potentially freezing the process I wanted to make sure about that.

I'm checking the manual, it looks like markdup by default removes PCR duplicates and not optical duplicates, I think that's what rmdup does too.

Thanks for providing some guidance on this.

samtools • 1.7k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 20 months ago by raggi.emanuele • 0

score 1 · Answer 1 · 2022-08-18

Rmdup removes reads, so you'd need markdup -r to do the same thing. However that's not ideal way of working anyway. It's better to leave all the data intact and just filter on-the-fly while processing, as then you can always go back from BAM to FASTQ still meaning you don't need to keep a separate FASTQ copy.

The rmdup -S option doesn't seem to have an equivalent that I can see, but I'm not sure what it was for anyway. If you have pair information, why not use it to get a more accurate determination of duplicates?

Finally, markdup man page does imply optical duplicates aren't processed by default ("Default is 0 to not look for optical duplicates"), but it's easy enough to enable it as described there. I've no idea what rmdup did, but there's no mention of "optical" anywhere in its source code so I think you're correct in that it didn't cover this case.

Note markdup runs on a coordinate sorted file, that has previously been through samtools fixmate -m while name collated. I don't know what sort order rmdup needs as it claims to want "sorted" without actually defining if it's name or position! I assume position sorted though?

Note the strategy of the two halves of markdup (fixmate + mark) in different sort orders is a way to improve performance. The aligner output will be name collated, so we can cache some data about supplementaries, secondaries, quality values for mate pairs, etc at that stage while all the data is collated together. After the normal position sort step has taken place, we can then rapidly determine the duplicates by looking at the alignment positions coupled with the earlier cached data. This avoids internal sorting steps used by e.g. elPrep, meaning both the fixmate and markdup phases are streaming operations.