Question

RNA variant calling, aligned with HiSat2

0

Entering edit mode

2.8 years ago

OhHiImNewHere ▴ 10

Trying to do my variant calling on RNA paired tumour-normal pairs with Mutect2.

Reads were aligned with Hisat2 and alignment is 97-98%.

I ran Mutect 2 and it gave me about 100 SNPs and 100 indels (across the whole exome). I validated these against known variants from previous targeted sequencing on the same subject.

I then went back and did some preprocessing- marking duplicates, splitNCigar and BQSR.

The SNPs are about the same- slightly fewer but probably just increased sensitivity- but I now have 5000 indels.

I have a feeling that SplitNCigar should not be used on Hisat2 aligned reads and this is the reason for these 'indels'.

Has anyone else found similar? Can I variant call using Hisat2 aligned reads, and just skip the SplitNCigar step?

hisat2 gatk splitncigar • 1.6k views

ADD COMMENT • link 2.8 years ago by OhHiImNewHere ▴ 10

score 2 · Answer 1 · 2021-07-19

2

Entering edit mode

2.8 years ago

benformatics 3.9k

The reason you have 5000 indels is probably because some spliced reads are now (for whatever reason) being considered indels. Since GATK is mainly used for DNA-seq, then it's not really a surprise.

The SplitNCigar step is needed for running BQSR but to what extent this is even helpful for RNA-seq is unclear.

You could be correct in thinking that HISAT2 alignments don't have the right flags/scores to be compatible with GATK's SplitNCigar, to verify this I would suggest re-aligning with STAR (2-pass mode) as suggested in the link from ATpoint. I do recall that bowtie had some issues with GATK so it could be that those also exist in Hisat2.

Either way I would be extremely skeptical of any indels from RNA-seq data... probably safest to focus on SNVs. If you are extremely interested in specific INDELs, I would make sure to go look at them in IGV and verify that they are real before moving forward.

ADD COMMENT • link 2.8 years ago by benformatics 3.9k

1

Entering edit mode

I far as I understand it, SplitNCigar was designed specifically to make data alignmed with a splice-aware aligner compatible with the GATK pipeline. Its clearly helpful for RNAseq because it takes a spliced read and turns it into two non-spliced reads.

ADD REPLY • link 2.8 years ago by i.sudbery 19k

0

Entering edit mode

Thanks- I think after some discussion that the best thing to do is just focus on SNVs.

ADD REPLY • link 2.8 years ago by OhHiImNewHere ▴ 10

score 1 · Answer 2 · 2021-07-19

As the process of variant calling in RNA-seq is even less standardized as compared to DNA-seq, and given that RNA-seq is inferior than DNA-seq for it, I would at least (and simply) follow the guidelines Broad has put out:

https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-

Not saying it is perfect (in fact I do not know) but at least it is a path to follow and a somewhat citable reference.