Question

Converting MuTect v1 text outputs to VCF

0

Entering edit mode

6.9 years ago

John Ma ▴ 310

I have received some MuTect v1 text outputs from our collaborator. While I can probably write scripts that can convert the output into VCF, are there software out there that does this? R, Perl or Python are fine.

snp MuTect VCF • 2.6k views

ADD COMMENT • link 6.9 years ago by John Ma ▴ 310

0

Entering edit mode

I remember that I saw a similar question in the GATK community a while ago, and the guys there said that there is no tool for post-hoc conversion. They suggested, as most time-saving workaround, to extract the genomic regions that are mutated from the file, and re-run MuTect on only these regions with the -vcf option. Maybe you can get your collaborators to do that for you.

ADD REPLY • link 6.9 years ago by ATpoint 82k

score 1 · Answer 1 · 2017-06-15

1

Entering edit mode

6.9 years ago

John Ma ▴ 310

Since the comment above implied there isn't, eventually I wrote one in R that should provide all the fields that -vcf emits; but since the raw MuTect text out doesn't give out the raw DP, the DP is derived as the sum of the two ADs.

This script is also based on the assumption that MuTect1 only emits 1 alt per position, and that it gives only SNVs.

It's too long to share here; I'll post it to gitHub in a later comment.

ADD COMMENT • link 6.9 years ago by John Ma ▴ 310

0

Entering edit mode

library (vcfR)

MT12VCF <- function (inputFile, outFile){

  inputDF <- read.delim (inputFile, stringsAsFactors=F)

  if ((length(unique(inputDF$normal_name)) > 1) ||
      (length(unique(inputDF$tumor_name)) > 1)){
    stop ("This function assumes there's only one normal-tumor pair in the input.")
  }

  MuTect1StdHeaders <- '##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="Accept as a confident somatic mutation">
##FILTER=<ID=REJECT,Description="Rejected as a confident somatic mutation">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=BQ,Number=A,Type=Float,Description="Average base quality for reads supporting alleles">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FA,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal,0=wildtype,1=germline,2=somatic,3=LOH,4=post-transcriptional modification,5=unknown">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">
##INFO=<ID=VT,Number=1,Type=String,Description="Variant type, can be SNP, INS or DEL">
##INFO=<ID=t_lod_fstar,Number=1,Type=Float,Description=" Log of (likelihood tumor event is real / likelihood event is sequencing error )">'

  MuTect1StdHeadersVec <- unlist(strsplit(MuTect1StdHeaders, "\n", fixed=TRUE))

  dbsnpField <- sapply(inputDF$dbsnp_site, 
                       function(x) ifelse(x=="NOVEL", "", "DB;"))
  somaticField <- sapply(inputDF$judgement, 
                         function(x) ifelse(x=="KEEP", "SOMATIC;", ""))
  filterField <- sapply(inputDF$judgement, 
                        function(x) ifelse(x=="KEEP", "PASS", "REJECT"))

  vcfFixDf <- data.frame (
    CHROM = inputDF$contig,
    POS = inputDF$position,
    ID = NA_character_,
    REF = inputDF$ref_allele,
    ALT = inputDF$alt_allele,
    QUAL = inputDF$score,
    FILTER = filterField,
    INFO = paste(sep="",
                 "MQ0=",inputDF$map_q0_reads, ";",
                 dbsnpField, somaticField,
                 "t_lod_fstar=", inputDF$t_lod_fstar,";",
                 "VT=SNP"),
    stringsAsFactors = FALSE
  )

  vcfGTdf <- data.frame (
    FORMAT = "GT:AD:BQ:DP:FA",
    NORMAL = paste (0, 
                    paste (inputDF$n_ref_count,inputDF$n_alt_count, sep=","),
                    ".",
                    inputDF$n_ref_count+inputDF$n_alt_count,
                    inputDF$n_alt_count / 
                      (inputDF$n_ref_count+inputDF$n_alt_count),
                    sep=":"
    ),
    TUMOR = paste ('0/1', 
                   paste (inputDF$t_ref_count,inputDF$t_alt_count, sep=","),
                   paste (inputDF$t_ref_sum / inputDF$t_ref_count,
                          inputDF$t_alt_sum / inputDF$t_alt_count, sep=","),
                   inputDF$t_ref_count+inputDF$t_alt_count,
                   inputDF$t_alt_count / 
                     (inputDF$t_ref_count+inputDF$t_alt_count),
                   sep=":"
    ),
    stringsAsFactors = F
  )

  vcfGTdf[inputDF$judgement=="PASS", "FORMAT"] <- paste (
    vcfGTdf[inputDF$judgement=="PASS", "FORMAT"], ":SS", sep=""
  )

  vcfGTdf[inputDF$judgement=="PASS", "NORMAL"] <- paste (
    vcfGTdf[inputDF$judgement=="PASS", "NORMAL"], ":0", sep=""
  )

  vcfGTdf[inputDF$judgement=="PASS", "TUMOR"] <- paste (
    vcfGTdf[inputDF$judgement=="PASS", "TUMOR"], ":2", sep=""
  )

  rownames(vcfGTdf)[c(2,3)] <- inputDF[1, c("normal_name", "tumor_name")]

  vcfRObj <- new(Class = "vcfR")
  vcfRObj@meta <- MuTect1StdHeadersVec
  vcfRObj@fix <- as.matrix (vcfFixDf)
  vcfRObj@gt <- as.matrix(vcfGTdf)

  write.vcf (vcfRObj, outFile)

}

ADD REPLY • link 6.9 years ago by John Ma ▴ 310