Hello,
I have modified and am sharing a script and processing steps that implements the workflow of the original RNA-MuTect pipeline (https://github.com/broadinstitute/RNA_MUTECT_1.0-1/tree/master) using a modern bioinformatics toolchain.
The original RNA-MuTect pipeline is a validated method for identifying somatic variants in RNA-Seq data. However, its reliance on GATK3
, MuTect1
, and the hg19
reference makes it difficult to implement in today's analysis environments.
This script automates the full pipeline, from raw RNA BAM to a final, re-aligned VCF, using GATK4
and HISAT2
. It is designed to be a reproducible and user-friendly starting point for RNA-based somatic variant discovery.
GitHub Repository: https://github.com/seq2c/modern-rna-mutect
Key points
- Modes: tumor-only or matched-normal
- Parallelized SplitNCigarReads, Mutect2, and Funcotator across contigs.
- Faithful logic: extract site-overlapping reads -> HISAT2 re-align -> Mutect2 re-call on intervals.
The output of the script is a VCF file and its associated stats file, which can be used as input for the further filtering steps outlined in the original RNA-MuTect paper. Rewriting the old matlab code for filtering is listed in the to-do list and will be shared once completed. Further improvements, such as supporting BAMs from different aligners such as minimap2, more adaptable to any reference genomes and callers, are also planned.
Feedback, feature suggestions, and bug reports are welcome via the GitHub repository's issue tracker. I hope this kind of summary proves useful to the community!