How to format "I" and "D" in vcf version 4.2 for liftover analysis in GATK
5 weeks ago

Hello everyone

I am facing challenges with liftover of a VCF file from hg19 to hg38 using GATK because of 'I' and 'D' annotations representing insertions and deletions in the VCF file.

Running command used for the liftover

gatk LiftoverVcf -I SNP_GRCh37.vcf  -O Liftover_with_Indels/lifted_over.vcf -C hg19ToHg38.over.chain.gz -WMC true  -R genome.fa --REJECT Liftover_with_Indels/rejeceted_variants.vcf --RECOVER_SWAPPED_REF_ALT True

Despite converted the VCF file to VCF 4.2 version using vcftools, I'm still having this issue.

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 200: Insertions/Deletions are not supported when reading 3.x VCF's. Please convert your file to VCF4 using 
VCFTools, available at, for input source: file:///SNP_GRCh37.vcf
    at htsjdk.variant.vcf.AbstractVCFCodec.generateException(
    at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(
    at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(
    at htsjdk.tribble.AsciiFeatureCodec.decode(
    at htsjdk.tribble.AsciiFeatureCodec.decode(
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(
    at htsjdk.tribble.TribbleIndexedFeatureReader$
    at htsjdk.tribble.TribbleIndexedFeatureReader$
    at picard.vcf.LiftoverVcf.doWork(
    at picard.cmdline.CommandLineProgram.instanceMain(
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(
    at org.broadinstitute.hellbender.Main.mainEntry(
    at org.broadinstitute.hellbender.Main.main(

Any suggestions on how to convert 'I' and 'D' annotations into a more acceptable format compatible with VCF 4.2 would be greatly appreciated. I've been struggling with this problem for a few days now."

I think this is useless, in the gatk doc:

For each variant, the tool will look for the target coordinate, reverse-complement and left-align the variant if needed, and, in the case that the reference and alternate alleles of a SNP have been swapped in the new genome build, it will adjust the SNP, and correct AF-like INFO fields and the relevant genotypes.

so, as far as I understand, the alleles must be ATGC. Unless you find a way to restore the REF and ALT sequences you'd better re-call the bam with modern tools.

Pierre Lindenbaum is correct. Also, do notice that the GATK option --RECOVER_SWAPPED_REF_ALT True does not work with indels. In general, if your VCF includes indels, avoid tools such as GATK/LiftoverVcf or CrossMap/VCF, as explained here

