Hi all,
I’m working on a human RNA-seq variant calling pipeline using GATK (v4.3), and I recently realized that I may have swapped two key steps in the preprocessing stage. Here's what I did:
Alignment with HISAT2
Conversion to sorted BAM
Step 1: SplitNCigarReads
Step 2: MarkDuplicates (Picard)
Then followed with BQSR, HaplotypeCaller, and filtering
However, I now see that several GATK tutorials and forums suggest doing MarkDuplicates before SplitNCigarReads. I’m concerned whether my current pipeline (with the reverse order) may lead to incorrect or biased variant calls.
Would this have a significant impact on the results (e.g., duplicate marking failing, false positives, coverage distortion, etc.)?
Has anyone compared results from both orderings or found issues when SplitNCigarReads comes first?
Thanks in advance for your insights!