I have received some MuTect v1 text outputs from our collaborator. While I can probably write scripts that can convert the output into VCF, are there software out there that does this? R, Perl or Python are fine.
I have received some MuTect v1 text outputs from our collaborator. While I can probably write scripts that can convert the output into VCF, are there software out there that does this? R, Perl or Python are fine.
Since the comment above implied there isn't, eventually I wrote one in R that should provide all the fields that -vcf
emits; but since the raw MuTect text out doesn't give out the raw DP, the DP is derived as the sum of the two ADs.
This script is also based on the assumption that MuTect1 only emits 1 alt per position, and that it gives only SNVs.
It's too long to share here; I'll post it to gitHub in a later comment.
library (vcfR)
MT12VCF <- function (inputFile, outFile){
inputDF <- read.delim (inputFile, stringsAsFactors=F)
if ((length(unique(inputDF$normal_name)) > 1) ||
(length(unique(inputDF$tumor_name)) > 1)){
stop ("This function assumes there's only one normal-tumor pair in the input.")
}
MuTect1StdHeaders <- '##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="Accept as a confident somatic mutation">
##FILTER=<ID=REJECT,Description="Rejected as a confident somatic mutation">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=BQ,Number=A,Type=Float,Description="Average base quality for reads supporting alleles">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FA,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal,0=wildtype,1=germline,2=somatic,3=LOH,4=post-transcriptional modification,5=unknown">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">
##INFO=<ID=VT,Number=1,Type=String,Description="Variant type, can be SNP, INS or DEL">
##INFO=<ID=t_lod_fstar,Number=1,Type=Float,Description=" Log of (likelihood tumor event is real / likelihood event is sequencing error )">'
MuTect1StdHeadersVec <- unlist(strsplit(MuTect1StdHeaders, "\n", fixed=TRUE))
dbsnpField <- sapply(inputDF$dbsnp_site,
function(x) ifelse(x=="NOVEL", "", "DB;"))
somaticField <- sapply(inputDF$judgement,
function(x) ifelse(x=="KEEP", "SOMATIC;", ""))
filterField <- sapply(inputDF$judgement,
function(x) ifelse(x=="KEEP", "PASS", "REJECT"))
vcfFixDf <- data.frame (
CHROM = inputDF$contig,
POS = inputDF$position,
ID = NA_character_,
REF = inputDF$ref_allele,
ALT = inputDF$alt_allele,
QUAL = inputDF$score,
FILTER = filterField,
INFO = paste(sep="",
"MQ0=",inputDF$map_q0_reads, ";",
dbsnpField, somaticField,
"t_lod_fstar=", inputDF$t_lod_fstar,";",
"VT=SNP"),
stringsAsFactors = FALSE
)
vcfGTdf <- data.frame (
FORMAT = "GT:AD:BQ:DP:FA",
NORMAL = paste (0,
paste (inputDF$n_ref_count,inputDF$n_alt_count, sep=","),
".",
inputDF$n_ref_count+inputDF$n_alt_count,
inputDF$n_alt_count /
(inputDF$n_ref_count+inputDF$n_alt_count),
sep=":"
),
TUMOR = paste ('0/1',
paste (inputDF$t_ref_count,inputDF$t_alt_count, sep=","),
paste (inputDF$t_ref_sum / inputDF$t_ref_count,
inputDF$t_alt_sum / inputDF$t_alt_count, sep=","),
inputDF$t_ref_count+inputDF$t_alt_count,
inputDF$t_alt_count /
(inputDF$t_ref_count+inputDF$t_alt_count),
sep=":"
),
stringsAsFactors = F
)
vcfGTdf[inputDF$judgement=="PASS", "FORMAT"] <- paste (
vcfGTdf[inputDF$judgement=="PASS", "FORMAT"], ":SS", sep=""
)
vcfGTdf[inputDF$judgement=="PASS", "NORMAL"] <- paste (
vcfGTdf[inputDF$judgement=="PASS", "NORMAL"], ":0", sep=""
)
vcfGTdf[inputDF$judgement=="PASS", "TUMOR"] <- paste (
vcfGTdf[inputDF$judgement=="PASS", "TUMOR"], ":2", sep=""
)
rownames(vcfGTdf)[c(2,3)] <- inputDF[1, c("normal_name", "tumor_name")]
vcfRObj <- new(Class = "vcfR")
vcfRObj@meta <- MuTect1StdHeadersVec
vcfRObj@fix <- as.matrix (vcfFixDf)
vcfRObj@gt <- as.matrix(vcfGTdf)
write.vcf (vcfRObj, outFile)
}
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I remember that I saw a similar question in the GATK community a while ago, and the guys there said that there is no tool for post-hoc conversion. They suggested, as most time-saving workaround, to extract the genomic regions that are mutated from the file, and re-run MuTect on only these regions with the -vcf option. Maybe you can get your collaborators to do that for you.