Why am I getting the exact same MarkDuplicates results when passing a query or coordinate sorted bam?
0
0
Entering edit mode
3 months ago
curious ▴ 890

I am following a standard protocol for alignment, which mentions:

If a primary alignment is marked as duplicate, then all supplementary alignments for that same read should also be marked as duplicates ... For Picard, you must use >= version 2.4.1 and run on a queryname sorted input file.

Also from the picard MarkDuplicates docs:

When the input is coordinate-sorted... supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then ...secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

Thus I would expect that there would be a difference in output from MarkDuplicates when either coordinate or queryname sorted crams are used as input, but my pipeline is giving the exact same results regardless. What could be causing this?

here are my commands:

alignment

    bwa mem \
    -K 100000000 \
    -Y \
    -R "@RG\\tID:{FC}.{LN}\\tSM:{SM}\\tPL:ILLUMINA\\tPU:{FC}.{LN}.{SAMPLE_BARCODE}\\tLB:{LB}\\tCN:{CN}" \
    {ref} \
    {read1} \
    {read2} | \
    samtools view  -C --reference {ref} -o {cram} -

sort by coordinate

samtools sort  -O CRAM --reference {ref} -o {coord_sort_cram} {cram}

sort by queryname

samtools sort -n -O CRAM --reference {ref} -o {query_sort_cram}  {cram}

mark dups of coordinate sorted

    MarkDuplicates \
    I={coord_sort_cram} \
    O={coord_sort_cram.mark_dup_cram} \
    M={coord_sort_cram.dup_metrics} \
    R={ref} 

make dups of query sorted

   MarkDuplicates \
    I={query_sort_cram} \
    O={query_sort_cram.mark_dup_cram} \
    M={query_sort_cram.dup_metrics} \
    ASSUME_SORT_ORDER=queryname \
    R={ref}

Why are these the same?

{coord_sort_cram.dup_metrics}
{query_sort_cram.dup_metrics}
picard bwa • 848 views
ADD COMMENT
0
Entering edit mode

Can you provide the actual metrics output?

What could be causing this?

Perhaps your data has no secondary/supplementary alignments.

Also have you seen this: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats

All GATK tools that take in mapped read data expect a BAM file as primary format. Some support the CRAM format, but we have observed performance issues when working directly from CRAM files, so in our own work we convert CRAM to BAM first, and we only use CRAM for archival purposes.

ADD REPLY
0
Entering edit mode

There are no secondary alignments as per :

samtools -f  0x100 {my_bam}

I only get secondary alignments if I add the -a flag to bwa mem.

There are supplementary as per:

samtools -f 0x800 {my_bam}

This is the head of the dup metrics file, which is identical for coordinate or queryname sorted. I am just using a sample of my data here to test my pipline, but with 7070 supplementary reads I would have expected a difference between the two sorting approaches:

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=my_input.bam OUTPUT=my_output.bam METRICS_FILE=sample.MarkDuplicates.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_DUP_STRATEGY=FLOW_QUALITY_SUM_STRATEGY FLOW_USE_END_IN_UNPAIRED_READS=false FLOW_USE_UNPAIRED_CLIPPED_END=false FLOW_UNPAIRED_END_UNCERTAINTY=0 FLOW_UNPAIRED_START_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri May 30 16:11:41 EDT 2025

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Sample1 17713   1982104 54582   18079   7070    813650  809733  0.410448    175056242

Also thanks for the BAM vs CRAM warning that makes a big difference in how I design my pipeline. I have rerun things using the BAM format and see the same issue of no dup marking difference between coordinate and query sorted files.

ADD REPLY
0
Entering edit mode

I think I figured this out (maybe). It looks like MarkDuplicates is marking supplementary reads as duplicates in the query-sorted CRAM, but not in the coordinate-sorted one even though the metrics report the exact same number of duplicates for both.

To dig into this, I extracted supplementary reads using:

samtools view -f 0x800 {query_sort_cram.mark_dup_cram} | cut -f 1 | sort | uniq > supp_reads.txt

Then I compared the relevant records between the two BAMS:

 diff <(samtools view -N supp_reads.txt  {query_sort_cram.mark_dup_bam} | sort ) <(samtools view -N supp_reads.txt  {coord_sort_cram.mark_dup_bam}  | sort) | less -S

I used the explain SAM flags to spot-check a few differences at the top of the diff. In every case I looked at, the primary reads were treated the same across both files but supplementary reads were marked as duplicates only in the query-sorted BAM.

I’m not sure why the duplicate metrics are identical. Maybe MarkDuplicates treats supplementary duplicates as part of the same duplicate "event" as their primary and doesn’t count them separately in the summary?

ADD REPLY
0
Entering edit mode

Maybe MarkDuplicates treats supplementary duplicates as part of the same duplicate "event" as their primary and doesn’t count them separately in the summary?

Since the metrics category title says SECONDARY_**OR**_SUPPLEMENTARY_RDS perhaps they are only counted once?

ADD REPLY

Login before adding your answer.

Traffic: 5034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6