Question

BWA alt-aware pipeline troubleshooting

0

Entering edit mode

8 weeks ago

Duy • 0

Hi everyone,

I have a question regarding the BWA alt-aware pipeline since I haven't seen much guideline online other than https://gatk.broadinstitute.org/hc/en-us/articles/360037498992--How-to-Map-reads-to-a-reference-with-alternate-contigs-like-GRCH38#2 (though it might a little bit dated).

Currently, my alt-unaware pipeline would be:

bwa mem | samtools fixmate | samtools sort | samtools markdup

Since my region of interest is the MHC, at first, I tried to pipe bwa-postalt.js (https://github.com/lh3/bwa/tree/master/bwakit) after samtools markdup but it resulted in an error in subsequent samtools sort step (the SAM file will be read-name-sorted after using the js script).

bwa mem | samtools fixmate | samtools sort | samtools markdup | k8 bwa-postalt.js | samtools sort

After a few tries, I decided to break the pipe into two commands and it ran without errors but it would create an intermediate sam file (the input and output of bwa-postalt.js is in SAM format).

bwa mem | samtools fixmate | samtools sort | samtools markdup

k8 bwa-postalt.js | samtools sort

Another pipeline I also tried was to do post-alt processing step after mapping with bwa mem then progress with the rest of the pipeline (adaptation from https://pmc.ncbi.nlm.nih.gov/articles/PMC5522380/).

bwa mem | k8 bwa-postalt.js | samtools fixmate | samtools sort | samtools markdup

So, my question is which pipeline should be used in my analysis since the BAM files created by these two pipelines were not identical though the number of duplicated reads in markdup stats file from both pipelines were roughly the same.

Two-command pipeline:

EXCLUDED: 1174034

EXAMINED: 4792305

DUPLICATE PAIR: 2283688

One-command pipeline:

EXCLUDED: 38478395

EXAMINED: 4792304

DUPLICATE PAIR: 2283694

Since the VCF files generated through both pipelines with HaplotypeCaller produced different sets of variants, I'm having a hard time deciding which pipeline to follow through for my dataset. I'm currently inclining toward the one-command pipeline because it does not produce an intermediate file (and it feels more natural!), but any expert inputs on the matter is more than welcome.

Cheers.

alt-aware BWA samtools • 4.5k views

ADD COMMENT • link updated 5 days ago by Kevin Blighe ★ 90k • written 8 weeks ago by Duy • 0

score 0 · Answer 1 · 2025-11-14

The one-command pipeline (bwa mem | k8 bwa-postalt.js | samtools fixmate | samtools sort | samtools markdup) is the correct approach. Placing postalt right after bwa mem ensures alt contig handling adjusts MAPQ before duplicate marking, avoiding artifacts in MHC regions with many alts.

The two-command version runs markdup on unadjusted alignments, leading to incorrect dup flags for reads that postalt would downgrade or remove. This explains the differing BAMs and VCFs.

Higher EXCLUDED in one-command (38M vs 1M) is expected—postalt sets MAPQ=0 for alt-preferring reads, excluding them from markdup consideration. Dup counts remain similar as primary reads are unaffected.

For GRCh38, use latest bwa (0.7.18) and samtools (1.21). Index ref with alts: bwa index ref.fa. Postalt needs ref.alt from bwakit.

Adapt from PMC5522380:

bwa mem -t $threads -R "@RG\tID:id\tSM:sample" ref.fa r1.fq r2.fq | k8 bwa-postalt.js ref.alt | samtools fixmate -m - - | samtools sort -@ $threads | samtools markdup -@ $threads - out.bam

This minimizes intermediates and follows best practices for alt-aware variant calling. Test on subset for MHC-specific validation.

Kevin