How to properly pass from samtools rmdup to samtools markdup
1
0
Entering edit mode
6 weeks ago

I am using for the first time this week Whippet.jl

A friend passed me an old pipeline and he used samtools rmdup. However I know from the documentation that command is obsolete and markdup should be used instead.

The old command was: samtools rmdup -S filename.sort.bam filename.sort.rmdup.bam

I am trying to pass to the new notation and I think that this could be the equivalent: samtools markdup -S filename.sort.bam filename.sort.markdup.bam

I need to compare the results but I wanted to make sure that those commands do the same thing. The file are enormous and before potentially freezing the process I wanted to make sure about that.

I'm checking the manual, it looks like markdup by default removes PCR duplicates and not optical duplicates, I think that's what rmdup does too.

Thanks for providing some guidance on this.

rmdup markdup samtools • 263 views
1
Entering edit mode
6 weeks ago
jkbonfield ▴ 890

Rmdup removes reads, so you'd need markdup -r to do the same thing. However that's not ideal way of working anyway. It's better to leave all the data intact and just filter on-the-fly while processing, as then you can always go back from BAM to FASTQ still meaning you don't need to keep a separate FASTQ copy.

The rmdup -S option doesn't seem to have an equivalent that I can see, but I'm not sure what it was for anyway. If you have pair information, why not use it to get a more accurate determination of duplicates?

Finally, markdup man page does imply optical duplicates aren't processed by default ("Default is 0 to not look for optical duplicates"), but it's easy enough to enable it as described there. I've no idea what rmdup did, but there's no mention of "optical" anywhere in its source code so I think you're correct in that it didn't cover this case.

Note markdup runs on a coordinate sorted file, that has previously been through samtools fixmate -m while name collated. I don't know what sort order rmdup needs as it claims to want "sorted" without actually defining if it's name or position! I assume position sorted though?

Note the strategy of the two halves of markdup (fixmate + mark) in different sort orders is a way to improve performance. The aligner output will be name collated, so we can cache some data about supplementaries, secondaries, quality values for mate pairs, etc at that stage while all the data is collated together. After the normal position sort step has taken place, we can then rapidly determine the duplicates by looking at the alignment positions coupled with the earlier cached data. This avoids internal sorting steps used by e.g. elPrep, meaning both the fixmate and markdup phases are streaming operations.

1
Entering edit mode

Just a note about optical duplicates. In markdup (and rmdup) optical duplicates are a subset of duplicates. The optical duplicate -d option in markdup labels duplicates as either PCR or optical in the auxiliary tags, but it does not add any more duplicates. There will be the same number of duplicates with or without the optical option being set.

So if you are committed to removing the duplicates entirely then I would not bother setting the optical option. All it does is label a set of duplicates that you are subsequently removing from the file and will never see. If you do not set -d then the program will run faster (as it is doing less work).