I am following a standard protocol for alignment, which mentions:
If a primary alignment is marked as duplicate, then all supplementary alignments for that same read should also be marked as duplicates ... For Picard, you must use >= version 2.4.1 and run on a queryname sorted input file.
Also from the picard MarkDuplicates docs:
When the input is coordinate-sorted... supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then ...secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.
Thus I would expect that there would be a difference in output from MarkDuplicates when either coordinate or queryname sorted crams are used as input, but my pipeline is giving the exact same results regardless. What could be causing this?
here are my commands:
alignment
bwa mem \
-K 100000000 \
-Y \
-R "@RG\\tID:{FC}.{LN}\\tSM:{SM}\\tPL:ILLUMINA\\tPU:{FC}.{LN}.{SAMPLE_BARCODE}\\tLB:{LB}\\tCN:{CN}" \
{ref} \
{read1} \
{read2} | \
samtools view -C --reference {ref} -o {cram} -
sort by coordinate
samtools sort -O CRAM --reference {ref} -o {coord_sort_cram} {cram}
sort by queryname
samtools sort -n -O CRAM --reference {ref} -o {query_sort_cram} {cram}
mark dups of coordinate sorted
MarkDuplicates \
I={coord_sort_cram} \
O={coord_sort_cram.mark_dup_cram} \
M={coord_sort_cram.dup_metrics} \
R={ref}
make dups of query sorted
MarkDuplicates \
I={query_sort_cram} \
O={query_sort_cram.mark_dup_cram} \
M={query_sort_cram.dup_metrics} \
ASSUME_SORT_ORDER=queryname \
R={ref}
Why are these the same?
{coord_sort_cram.dup_metrics}
{query_sort_cram.dup_metrics}
Can you provide the actual metrics output?
Perhaps your data has no secondary/supplementary alignments.
Also have you seen this: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats
There are no secondary alignments as per :
I only get secondary alignments if I add the
-a
flag to bwa mem.There are supplementary as per:
This is the head of the dup metrics file, which is identical for coordinate or queryname sorted. I am just using a sample of my data here to test my pipline, but with 7070 supplementary reads I would have expected a difference between the two sorting approaches:
Also thanks for the BAM vs CRAM warning that makes a big difference in how I design my pipeline. I have rerun things using the BAM format and see the same issue of no dup marking difference between coordinate and query sorted files.
I think I figured this out (maybe). It looks like MarkDuplicates is marking supplementary reads as duplicates in the query-sorted CRAM, but not in the coordinate-sorted one even though the metrics report the exact same number of duplicates for both.
To dig into this, I extracted supplementary reads using:
Then I compared the relevant records between the two BAMS:
I used the explain SAM flags to spot-check a few differences at the top of the diff. In every case I looked at, the primary reads were treated the same across both files but supplementary reads were marked as duplicates only in the query-sorted BAM.
I’m not sure why the duplicate metrics are identical. Maybe MarkDuplicates treats supplementary duplicates as part of the same duplicate "event" as their primary and doesn’t count them separately in the summary?
Since the metrics category title says
SECONDARY_**OR**_SUPPLEMENTARY_RDS
perhaps they are only counted once?