Question: Writing vcf file headers
0
gravatar for spiral01
2.7 years ago by
spiral01100
spiral01100 wrote:

I have downloaded variation data from the popfly database (http://popfly.uab.cat). However, the vcf headers are all over the place. If I try to index the vcf files I get segmentation fault: 11, and if I try to parse them using bcftools I get various errors in the header. I think the easiest solution would be to entirely rewrite the headers. Is there a tool that does this?

As an aside, is there any benefit to vcf files over bed files? I know many annotation tools take vcf as input (e.g. snpEff), but I can strip away the header and convert to big bed and still annotate with annovar for instance.

Here is the header and the first few lines:

##fileformat=VCFv4.1
##contig=<ID=1,length=23011544>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RAL-149_Chr2L   RAL-335_Chr2L   RAL-357_Chr2L   RAL-399_Chr2L   RAL-42_Chr2L    RAL-461_Chr2L   RAL-491_Chr2L   RAL-703_Chr2L   RAL-721_Chr2L   RAL-783_Chr2L   USI02_Chr2L USI03_Chr2L USI16_Chr2L USI17_Chr2L USI24_Chr2L USI31_Chr2L USI33_Chr2L USI34_Chr2L USI35_Chr2L USI38_Chr2L USW_35_Chr2L    USW_37_Chr2L    USW_40_Chr2L    USW_49_Chr2L    USW_50_Chr2L    USW_54_Chr2L    USW_59_Chr2L    USW_66_Chr2L    USW_69_Chr2L    USW_74_Chr2L
2L  5039    .   C   N,T .   .   .   GT  0   1   1   2   1   0   1   0   1   1
2L  5076    .   G   N,T .   .   .   GT  0   1   1   1   1   0   1   0   1   1
2L  5092    .   C   N,T .   .   .   GT  0   1   2   2   2   0   1   0   1   1
2L  5095    .   T   N,A .   .   .   GT  0   1   2   2   2   0   1   0   1   1
2L  5317    .   G   N,A .   .   .   GT  0   0   0   0   0   0   1   0   1   0
2L  5372    .   T   N,A .   .   .   GT  0   1   0   0   2   0   1   0   1   0
snp vcf • 3.8k views
ADD COMMENTlink modified 5 weeks ago by johnston.mike.j10 • written 2.7 years ago by spiral01100
1

Can you post an example of the header that is giving you problems? I work with vcf daily, haven't had any problems!

ADD REPLYlink written 2.7 years ago by Macspider3.2k

Hi, I have added an example header. Thanks.

ADD REPLYlink written 2.7 years ago by spiral01100
1

header seems fine, what is the problem you get besides segmentation fault? Do you get more info?

ADD REPLYlink written 2.7 years ago by Macspider3.2k

When I try:

bcftools view -G file.vcf

I get:

[W::vcf_parse] contig '2L' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

Yet when I zip the file with bgzip and then try to tabix index it I get segmentation fault 11.

ADD REPLYlink written 2.7 years ago by spiral01100

It seems that your file doesn't have all the contigs in the header. How did you generate such vcf file? can you post the generation command?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Macspider3.2k

I didn't actually generate the file. It is from the pop fly drosophila database (http://popfly.uab.cat/). I confess that I am a primarily theoretical biologist and so do not generate my own data, so I am not particularly familiar with the vcf generation process.

ADD REPLYlink written 2.7 years ago by spiral01100
1

Ok, we might be closer to the solution. How did you get the file? can you download it from there?

Quick workaround for the time being:

you can modify the header lines where the contigs are specified, adding one for each chromosome (2L, 2R, 3L, 3R, X) where you specify the length. Like this:

##fileformat=VCFv4.1
##contig=<ID=2L,length=insert_length>
##contig=<ID=2R,length=insert_length>
##contig=<ID=3L,length=insert_length>
##contig=<ID=3R,length=insert_length>
##contig=<ID=X,length=insert_length>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ...
2L  5039    .   C   N,T .   .   .   GT  0   1   1   2   1   0   1   0   1   1
2L  5076    .   G   N,T .   .   .   GT  0   1   1   1   1   0   1   0   1   1
2L  5092    .   C   N,T .   .   .   GT  0   1   2   2   2   0   1   0   1   1
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Macspider3.2k

Thank you. Yes you can download the file from the link I sent above. Under resources you can download data. It is the vcf file for the AM population (I cannot link directly for some reason).

ADD REPLYlink written 2.7 years ago by spiral01100

Have you checked the line endings and file encoding? Perhaps that is what is tripping up the software.

If you want to post the VCF in question I can take a closer look at it.

ADD REPLYlink written 2.7 years ago by Dan D7.1k

Thanks. I have edited to include both the header and the first few lines.

ADD REPLYlink written 2.7 years ago by spiral01100
0
gravatar for johnston.mike.j
5 weeks ago by
johnston.mike.j10 wrote:

It seems that the contig header lines are incorrect. If you have access to the genome fasta index that was used for the analysis, the following script should remove the existing '##contig' lines and replace them with the correct details from the fasta index.

# Input / output filenames
index_file="/home/user/reference/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.fai"
vcf_bad_header="genotypes.incorrectHeaderContigs.vcf"
output_vcf="genotypes.vcf"

# Extract assembly name
# Ensure the file extension passed to 'basename' matches the suffix of the index file
assembly_name=$(basename $index_file .fna.fai)

# Build desired '##contig' lines from fasta index file
cut -f 1-2 "$index_file" |\
sed -r 's/^(chr)/##contig=<ID=\1/' |\
sed -r 's/\s([0-9]+)/,length=\1/' |\
sed -r "s/$/,assembly=$assembly_name>/" > "$assembly_name.vcfHeaderContigs.txt"

# Locate line numbers of VCF header containing '##contig' entries
start_line=$(grep -n -m 1 "##contig" "$vcf_bad_header" | cut -d ":" -f 1)
end_line=$(grep -n "##contig" "$vcf_bad_header" | tail -n 1 | cut -d ":" -f 1)

# Build VCF: (1) before bad contigs (2) new contig lines (3) continue after bad contigs
head -n $(($start_line-1)) "$vcf_bad_header" > "$output_vcf"
cat "$assembly_name.vcfHeaderContigs.txt" >> "$output_vcf"
tail -n +$(($end_line+1)) "$vcf_bad_header" >> "$output_vcf"
ADD COMMENTlink written 5 weeks ago by johnston.mike.j10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1285 users visited in the last hour