Error: malformed header while uploading .vcf.gz files to Michigan imputation server
Entering edit mode
19 months ago
manprees • 0

I have been going around a problem for a 2 days while uploading data to the Michigan imputation server. Any help is much appreciated!

I received .bgen files from 23andme and a single .bgen file contains all the participants and the genotype data. As per the guidelines of Michigan imputation server I converted .bgen file to .vcf file using qctools using the command:

$ qctool -g example.bgen -og example.vcf

Then I followed the following steps (using plink) so that the data could be uploaded to the server:

# compress vcf to gz
bgzip -c ${1}.vcf > ${1}.vcf.gz

# make tabix index
tabix -p vcf ${1}.vcf.gz

# split into 22 separate chromosomes.

tabix -h ${1}.vcf.gz 1 > ${1}.chr01.vcf
tabix -h ${1}.vcf.gz 2 > ${1}.chr02.vcf
tabix -h ${1}.vcf.gz 3 > ${1}.chr03.vcf
tabix -h ${1}.vcf.gz 4 > ${1}.chr04.vcf
tabix -h ${1}.vcf.gz 5 > ${1}.chr05.vcf
tabix -h ${1}.vcf.gz 6 > ${1}.chr06.vcf
tabix -h ${1}.vcf.gz 7 > ${1}.chr07.vcf
tabix -h ${1}.vcf.gz 8 > ${1}.chr08.vcf
tabix -h ${1}.vcf.gz 9 > ${1}.chr09.vcf
tabix -h ${1}.vcf.gz 10 > ${1}.chr10.vcf
tabix -h ${1}.vcf.gz 11 > ${1}.chr11.vcf
tabix -h ${1}.vcf.gz 12 > ${1}.chr12.vcf
tabix -h ${1}.vcf.gz 13 > ${1}.chr13.vcf
tabix -h ${1}.vcf.gz 14 > ${1}.chr14.vcf
tabix -h ${1}.vcf.gz 15 > ${1}.chr15.vcf
tabix -h ${1}.vcf.gz 16 > ${1}.chr16.vcf
tabix -h ${1}.vcf.gz 17 > ${1}.chr17.vcf
tabix -h ${1}.vcf.gz 18 > ${1}.chr18.vcf
tabix -h ${1}.vcf.gz 19 > ${1}.chr19.vcf
tabix -h ${1}.vcf.gz 20 > ${1}.chr20.vcf
tabix -h ${1}.vcf.gz 21 > ${1}.chr21.vcf
tabix -h ${1}.vcf.gz 22 > ${1}.chr22.vcf

# create gz files for each chromosome

bgzip -c ${1}.chr01.vcf > ${1}.chr01.vcf.gz
bgzip -c ${1}.chr02.vcf > ${1}.chr02.vcf.gz
bgzip -c ${1}.chr03.vcf > ${1}.chr03.vcf.gz
bgzip -c ${1}.chr04.vcf > ${1}.chr04.vcf.gz
bgzip -c ${1}.chr05.vcf > ${1}.chr05.vcf.gz
bgzip -c ${1}.chr06.vcf > ${1}.chr06.vcf.gz
bgzip -c ${1}.chr07.vcf > ${1}.chr07.vcf.gz
bgzip -c ${1}.chr08.vcf > ${1}.chr08.vcf.gz
bgzip -c ${1}.chr09.vcf > ${1}.chr09.vcf.gz
bgzip -c ${1}.chr10.vcf > ${1}.chr10.vcf.gz
bgzip -c ${1}.chr11.vcf > ${1}.chr11.vcf.gz
bgzip -c ${1}.chr12.vcf > ${1}.chr12.vcf.gz
bgzip -c ${1}.chr13.vcf > ${1}.chr13.vcf.gz
bgzip -c ${1}.chr14.vcf > ${1}.chr14.vcf.gz
bgzip -c ${1}.chr15.vcf > ${1}.chr15.vcf.gz
bgzip -c ${1}.chr16.vcf > ${1}.chr16.vcf.gz
bgzip -c ${1}.chr17.vcf > ${1}.chr17.vcf.gz
bgzip -c ${1}.chr18.vcf > ${1}.chr18.vcf.gz
bgzip -c ${1}.chr19.vcf > ${1}.chr19.vcf.gz
bgzip -c ${1}.chr20.vcf > ${1}.chr20.vcf.gz
bgzip -c ${1}.chr21.vcf > ${1}.chr21.vcf.gz
bgzip -c ${1}.chr22.vcf > ${1}.chr22.vcf.gz

Then I uploaded the zipped gz files to the server and got the error of malformed header:

Unable to parse header with error: Your input file has a malformed header: Unexpected tag Type in line , for input source: /data3/imputation-server/workspace/job-20190822-200703-201/input/files/64ba01fa-b382-4b48-80b7-fdced5a84e11.vcf (see Help).

I understand that the header is malformed. Is it due to the absence of .sample file (which contains header information) while I was converting .bgen to .vcf format using qctool ?(or something else)

It would be really appreciated if you could tell me a way around!

SNP software error • 1.0k views
Entering edit mode

Please show us the header and some examples of the variants within the vcf file. Otherwise we can just guess.


fin swimmer

Entering edit mode

I created a link where you can see the .vcf file opened in bash and notepad for your reference: .vcf file in bash and notepad

Any help is much appreciated. Thanks!

Entering edit mode

Using a vcf-validator may help you pinpoint exactly what could be causing the error.

Entering edit mode

I used checkVCF to pinpoint the source of error:

It showed the following errors:

Line [ %d ] does not have GT defined in the FORMAT field 
Duplicated site [ 1:2526746 ]
Line [ 1845 ] does not have correct column number, exiting!

Does it mean that the .bgen file was not in the right format which i used to convert to .vcf file? Do you know of any way to go about it?

Entering edit mode

I have no idea what a .bgen file is, nor what its format is like. However, there are only 3 errors (each of which is explained quite plainly). You can correct them manually easily enough. One line is duplicated, one has the improper number of columns based on the headers, and one is missing GT in the format field.

Entering edit mode

I am new to this! It would be really helpful if you could help me out with the errors!

Entering edit mode

It is best to learn by doing. We don't have the ability to scroll through your file. Based on what you've found, you know there's likely an issue with the header and maybe with certain records. Look at the VCF specs and ensure your file meets them (particularly the metadata/header sections).

Entering edit mode

I was able to rectify the duplicated site error using:

( grep  '^#' input.vcf ; grep -v "^#" input.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n -k4,4 | awk -F '\t' 'BEGIN{ prev="";} {key=sprintf("%s\t%s\t%s",$1,$2,$4);if(key==prev) next;print;prev=key;}' )  > out.vcf

I figured out that using previous commands messed up my VCF header.

But i am still not able to solve the error:

Line [ %d ] does not have GT defined in the FORMAT field

I have defined the format field clearly which does not include GT. I am attaching the snip of the file for your reference.vcf file snip

Line 1841 does not have correct column number, exiting! I am highlighting the line in the snip for your reference. A roadmap to solve these errors will be much appreciated.

vcf snip with line 1841


Login before adding your answer.

Traffic: 1438 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6