Question: Add contig lenght to VCF header in a robust way
gravatar for William
2.6 years ago by
William4.4k wrote:

I have some older VCF files that don't have the contig length set in the VCF header. This means that Picard and some other tools that are very strict with the VCF spec won't accept them.

The contig entries in the header should be


but they are


I know that I can manually fix this by doing the following steps.

  • unzipping the file
  • extracting the header
  • lookup the contig lenghts in a fasta.fa.fai file
  • adding the lenght to the contigs records in the header with vim
  • re-header with bcftools
  • bgzip and tabix the the re-headered vcf file

In my hands this works but if you make a slight VIM copy paste error you will have spend a lot of time reheadering and bgzipping a large VCF file for nothing.

Therefore I would like to have a more robust automatic solution where I just give the VCF file and the reference genome file and the header is automatically fixed and a new bgzipped VCF file is written out.

Is there a tool that can add the contig lenghts to the VCF header and write out a new bgzipped VCF file?

vcf • 2.3k views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by William4.4k
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

Use awk to insert the contig before the #CHROM line. Something like

awk '/^#CHROM/ { printf("##contig=<ID=1,length=195471971>\n##contig=<ID=2,length=182113224>\n");} {print;}' in.vcf > out.vcf

you can generate the printf line above from the ref.fai file:

  awk '{printf("##contig=<ID=%s,length=%d>\\n",$1,$2);}' ref.fai
ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum116k

Great solution, particularly as there's an opportunity to double-check the lengths before writing them to the VCF :)

ADD REPLYlink written 2.6 years ago by John12k
gravatar for William
2.6 years ago by
William4.4k wrote:

A select all variants with the gatk framework also adds the contig lenght to the contigs in the header. Upside is that it's a one liner with a off the shelf tool:

gatk-framework -T SelectVariants -V input.vcf.gz -R /ref.fa -o input_header_fixed.vcf.gz

Downside side is that the full VCF is inflated to the GATK Variant/GenotypeContext objects and rewritten to a new VCF. So it's slower then AWK and or a c/c++ tool solution, and the new VCF is slightly different (different order or precision of Variant/Genotype attributes).

ADD COMMENTlink written 2.6 years ago by William4.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 843 users visited in the last hour