Question: Writing vcf file headers
0
gravatar for spiral01
18 months ago by
spiral0180
spiral0180 wrote:

I have downloaded variation data from the popfly database (http://popfly.uab.cat). However, the vcf headers are all over the place. If I try to index the vcf files I get segmentation fault: 11, and if I try to parse them using bcftools I get various errors in the header. I think the easiest solution would be to entirely rewrite the headers. Is there a tool that does this?

As an aside, is there any benefit to vcf files over bed files? I know many annotation tools take vcf as input (e.g. snpEff), but I can strip away the header and convert to big bed and still annotate with annovar for instance.

Here is the header and the first few lines:

##fileformat=VCFv4.1
##contig=<ID=1,length=23011544>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RAL-149_Chr2L   RAL-335_Chr2L   RAL-357_Chr2L   RAL-399_Chr2L   RAL-42_Chr2L    RAL-461_Chr2L   RAL-491_Chr2L   RAL-703_Chr2L   RAL-721_Chr2L   RAL-783_Chr2L   USI02_Chr2L USI03_Chr2L USI16_Chr2L USI17_Chr2L USI24_Chr2L USI31_Chr2L USI33_Chr2L USI34_Chr2L USI35_Chr2L USI38_Chr2L USW_35_Chr2L    USW_37_Chr2L    USW_40_Chr2L    USW_49_Chr2L    USW_50_Chr2L    USW_54_Chr2L    USW_59_Chr2L    USW_66_Chr2L    USW_69_Chr2L    USW_74_Chr2L
2L  5039    .   C   N,T .   .   .   GT  0   1   1   2   1   0   1   0   1   1
2L  5076    .   G   N,T .   .   .   GT  0   1   1   1   1   0   1   0   1   1
2L  5092    .   C   N,T .   .   .   GT  0   1   2   2   2   0   1   0   1   1
2L  5095    .   T   N,A .   .   .   GT  0   1   2   2   2   0   1   0   1   1
2L  5317    .   G   N,A .   .   .   GT  0   0   0   0   0   0   1   0   1   0
2L  5372    .   T   N,A .   .   .   GT  0   1   0   0   2   0   1   0   1   0
snp vcf • 1.6k views
ADD COMMENTlink modified 18 months ago • written 18 months ago by spiral0180
1

Can you post an example of the header that is giving you problems? I work with vcf daily, haven't had any problems!

ADD REPLYlink written 18 months ago by Macspider2.9k

Hi, I have added an example header. Thanks.

ADD REPLYlink written 18 months ago by spiral0180
1

header seems fine, what is the problem you get besides segmentation fault? Do you get more info?

ADD REPLYlink written 18 months ago by Macspider2.9k

When I try:

bcftools view -G file.vcf

I get:

[W::vcf_parse] contig '2L' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

Yet when I zip the file with bgzip and then try to tabix index it I get segmentation fault 11.

ADD REPLYlink written 18 months ago by spiral0180

It seems that your file doesn't have all the contigs in the header. How did you generate such vcf file? can you post the generation command?

ADD REPLYlink modified 18 months ago • written 18 months ago by Macspider2.9k

I didn't actually generate the file. It is from the pop fly drosophila database (http://popfly.uab.cat/). I confess that I am a primarily theoretical biologist and so do not generate my own data, so I am not particularly familiar with the vcf generation process.

ADD REPLYlink written 18 months ago by spiral0180
1

Ok, we might be closer to the solution. How did you get the file? can you download it from there?

Quick workaround for the time being:

you can modify the header lines where the contigs are specified, adding one for each chromosome (2L, 2R, 3L, 3R, X) where you specify the length. Like this:

##fileformat=VCFv4.1
##contig=<ID=2L,length=insert_length>
##contig=<ID=2R,length=insert_length>
##contig=<ID=3L,length=insert_length>
##contig=<ID=3R,length=insert_length>
##contig=<ID=X,length=insert_length>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ...
2L  5039    .   C   N,T .   .   .   GT  0   1   1   2   1   0   1   0   1   1
2L  5076    .   G   N,T .   .   .   GT  0   1   1   1   1   0   1   0   1   1
2L  5092    .   C   N,T .   .   .   GT  0   1   2   2   2   0   1   0   1   1
ADD REPLYlink modified 18 months ago • written 18 months ago by Macspider2.9k

Thank you. Yes you can download the file from the link I sent above. Under resources you can download data. It is the vcf file for the AM population (I cannot link directly for some reason).

ADD REPLYlink written 18 months ago by spiral0180

Have you checked the line endings and file encoding? Perhaps that is what is tripping up the software.

If you want to post the VCF in question I can take a closer look at it.

ADD REPLYlink written 18 months ago by Dan D6.8k

Thanks. I have edited to include both the header and the first few lines.

ADD REPLYlink written 18 months ago by spiral0180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1785 users visited in the last hour