Question

Does the order of SplitNCigarReads and MarkDuplicates affect RNA-seq variant calling results?

0

Entering edit mode

12 weeks ago

iamsmor • 0

Hi all,

I’m working on a human RNA-seq variant calling pipeline using GATK (v4.3), and I recently realized that I may have swapped two key steps in the preprocessing stage. Here's what I did:

Alignment with HISAT2

Conversion to sorted BAM

Step 1: SplitNCigarReads

Step 2: MarkDuplicates (Picard)

Then followed with BQSR, HaplotypeCaller, and filtering

However, I now see that several GATK tutorials and forums suggest doing MarkDuplicates before SplitNCigarReads. I’m concerned whether my current pipeline (with the reverse order) may lead to incorrect or biased variant calls.

Would this have a significant impact on the results (e.g., duplicate marking failing, false positives, coverage distortion, etc.)?

Has anyone compared results from both orderings or found issues when SplitNCigarReads comes first?

Thanks in advance for your insights!

variantcalling. rnaseq gatk • 546 views

ADD COMMENT • link updated 12 weeks ago by rfran010 ★ 1.6k • written 12 weeks ago by iamsmor • 0

score 0 · Answer 1 · 2025-06-24

I have not compared results as you suggest, but logically, there is a functional difference. Whether this has a great effect depends on the nature of your data.

Mark Duplicates generally works by marking reads with the same sequence and start position. and SplitNCigarReads splits one read into multiple reads. This could in theory affect duplicate marking, for one example, if you have two reads that start at two different positions (not duplicates), but after splitting the split reads now map to the same position with the same sequence, they may be marked duplicate, even though they probably are not.

Rough example:

Before splitting (not duplicates)
readA: ----ATGCGNNNNNNNNNNNNNNATTCGCGGGC
readB: CTAGATGCGNNNNNNNNNNNNNNATTCGCGGGC

After splitting (read C&D look like duplicates)
readA: ----ATGCG    readC: ATTCGCGGGC
readB: CTAGATGCG    readD: ATTCGCGGGC