Question

Merging VCF with bcftools, problems with INFO column when using tapes with merged vcf.

0

Entering edit mode

14 months ago

brunomiwa • 0

Hello,

I'm merging vcf files into one with bcftools merge. Here's an example of an individual VCF:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  01GO_S2
chr1    14604   N/A A   G   .   PASS    .   GT:GQ:DP:SR:VR:VA:SB:ABQ:AMQ    0/1:71.03:35:14.29:14.29:0:5.71:37.00:48.20
chr1    14610   N/A T   C   .   PASS    .   GT:GQ:DP:SR:VR:VA:SB:ABQ:AMQ    0/1:88.73:43:13.95:13.95:0:4.65:37.00:50.67

COMMAND:

bcftools merge -0 --missing-to-ref a.vcf.gz b.vcf.gz c.vcf.gz -o d.vcf

The merged vcf looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  01GO_S2 01MAR_S1    01PB_S4
chr1    14574   N/A A   G   .   PASS    .   GT:GQ:DP:SR:VR:VA:SB:ABQ:AMQ    0/0:.:.:.:.:.:.:.:. 0/1:80.46:14:28.57:28.57:0:0:34:33  0/0:.:.:.:.:.:.:.:.
chr1    14590   N/A G   A   .   PASS    .   GT:GQ:DP:SR:VR:VA:SB:ABQ:AMQ    0/0:.:.:.:.:.:.:.:. 0/1:101.13:25:20:20:0:8:37:38.6 0/1:62.74:10:30:30:0:0:37:57

When trying to use tapes, I receive the following error:

File "/home/bruno/.local/lib/python3.10/site-packages/vcf_parser/utils/format_variant.py", line 73, in format_variant
    raise SyntaxError("The INFO field {0} is not specified in vcf"\
SyntaxError: The INFO field . is not specified in vcf header. chr1  14574   N/A A   G   .   PASS    .   GT:GQ:DP:SR:VR:VA:SB:ABQ:AMQ    0/0:.:.:.:.:.:.:.:. 0/1:80.46:14:28.57:28.57:0:0:34:33  0/0:.:.:.:.:.:.:.:.

It seems that tapes (and other scripts that uses vcf_parser library of python) starts to recognize the '.' in INFO column as information, not as missing. But, the individual vcf didn't have any information in the INFO filed also. Tried manipulating the header to no avail. It seems to be a simple issue (My last resort is to manipulate the data by making a fake info field in all rows and declare it in the header).

Can someone help? Thanks in advance

bcftools • 550 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 14 months ago by brunomiwa • 0

0

Entering edit mode

Welcome can you please paste the lines of the vcf as text into the question and format them as code, rather than as pictures. Thank you.

ADD REPLY • link 14 months ago by 4galaxy77 2.8k

0

Entering edit mode

For someone who happens to get here. I managed to 'solve' to problem.

I have two types of data from the same samples: Fasta and VCF, both were provided from the company. So we don't know which procedures were adopted to acquire the VCF, like adapter trimming and which databases the snps were called from. Despite this, when using their VCF's, they weren't parcimony normalized, so the genotype.py that parse some arguments and informations was scrambled, my guess is that genotypes that differ from 0/0, 1/0, 0/1, 1/1 scrambled the data flow.

So normalizing with bcftools solved it.

ADD REPLY • link 14 months ago by brunomiwa • 0