I have downloaded variation data from the popfly database (http://popfly.uab.cat). However, the vcf headers are all over the place. If I try to index the vcf files I get segmentation fault: 11, and if I try to parse them using bcftools I get various errors in the header. I think the easiest solution would be to entirely rewrite the headers. Is there a tool that does this?
As an aside, is there any benefit to vcf files over bed files? I know many annotation tools take vcf as input (e.g. snpEff), but I can strip away the header and convert to big bed and still annotate with annovar for instance.
It seems that the contig header lines are incorrect.
If you have access to the genome fasta index that was used for the analysis, the following script should remove the existing '##contig' lines and replace them with the correct details from the fasta index.
# Input / output filenames
# Extract assembly name
# Ensure the file extension passed to 'basename' matches the suffix of the index file
assembly_name=$(basename $index_file .fna.fai)
# Build desired '##contig' lines from fasta index file
cut -f 1-2 "$index_file" |\
sed -r 's/^(chr)/##contig=<ID=\1/' |\
sed -r 's/\s([0-9]+)/,length=\1/' |\
sed -r "s/$/,assembly=$assembly_name>/" > "$assembly_name.vcfHeaderContigs.txt"
# Locate line numbers of VCF header containing '##contig' entries
start_line=$(grep -n -m 1 "##contig" "$vcf_bad_header" | cut -d ":" -f 1)
end_line=$(grep -n "##contig" "$vcf_bad_header" | tail -n 1 | cut -d ":" -f 1)
# Build VCF: (1) before bad contigs (2) new contig lines (3) continue after bad contigs
head -n $(($start_line-1)) "$vcf_bad_header" > "$output_vcf"
cat "$assembly_name.vcfHeaderContigs.txt" >> "$output_vcf"
tail -n +$(($end_line+1)) "$vcf_bad_header" >> "$output_vcf"
Can you post an example of the header that is giving you problems? I work with vcf daily, haven't had any problems!
Hi, I have added an example header. Thanks.
header seems fine, what is the problem you get besides segmentation fault? Do you get more info?
When I try:
Yet when I zip the file with bgzip and then try to tabix index it I get segmentation fault 11.
It seems that your file doesn't have all the contigs in the header. How did you generate such vcf file? can you post the generation command?
I didn't actually generate the file. It is from the pop fly drosophila database (http://popfly.uab.cat/). I confess that I am a primarily theoretical biologist and so do not generate my own data, so I am not particularly familiar with the vcf generation process.
Ok, we might be closer to the solution. How did you get the file? can you download it from there?
Quick workaround for the time being:
you can modify the header lines where the contigs are specified, adding one for each chromosome (2L, 2R, 3L, 3R, X) where you specify the length. Like this:
Thank you. Yes you can download the file from the link I sent above. Under resources you can download data. It is the vcf file for the AM population (I cannot link directly for some reason).
Have you checked the line endings and file encoding? Perhaps that is what is tripping up the software.
If you want to post the VCF in question I can take a closer look at it.
Thanks. I have edited to include both the header and the first few lines.