VCF deletions incorrectly formatted
1
1
Entering edit mode
4.8 years ago

Hi all,

I'm working with a vcf (v4.1) that has incorrectly formatted deletions for some reason. The insertions are fine, but the deletions are annotated as (example):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
2   32474671    indel.60227 A   -   .   PASS    .   GT

Notice that the ALT is -, when the line should have been formatted as such (example):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
2   32474670    indel.60227 GA  G   .   PASS    .   GT

I have no idea how the deletions ended up like this in the vcf, but my present plan is to parse a reference genome fasta file for these positions and manually correct all the deletion annotations, so I don't have to drop them from the vcf. What I wanted to know is if there's a tool that already does this- as it stands, I'm writing a manual parser.

vcf QC reference panel • 1.9k views
ADD COMMENT
1
Entering edit mode

It is quite odd that insertions are fine and deletions are not. Older VCF versions (4.1) had . for REF in insertions and . for ALT in deletions, so either both should be affected or neither should be.

Maybe give this tool a shot? Disclaimer: This tool is not mine and I have never used it. Maybe bcftools norm --check-ref can fix the REF alleles, I'm not sure though.

ADD REPLY
0
Entering edit mode

I definitely agree that it's odd. I'm having trouble finding older versions that used a - as ALT in deletions, so I'm not sure it's ever the case. A big part of this problem is that I can't figure out how the people who supplied the VCF ended up in this situation.

bcftools doesn't seem to fix the problem, probably because the REF alleles are fine; it's the ALT that are botched.

Looking into the other tool that you linked. Hopefully it helps.

ADD REPLY
1
Entering edit mode

I don't know of any tool that uses - - older versions used ., not -.

ADD REPLY
2
Entering edit mode
4.8 years ago

I wound up just parsing the vcf in Python and calling samtools to correct the deletions in the vcf. By subtracting 1 from the alleged bp_pos, calling samtools faidx <reference_genome> chr:del_start-del_end for each REF, and taking the first character of that for ALT, you can fill in the blanks. I don't believe there's a tool that corrects this problem in VCFs, because I don't think this is a common (or normal) problem. Leaving this here in case anyone ever encounters the same situation.

ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6