Below solution uses the pyvcf
submodule from the fuc
package I wrote:
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103],
... 'ID': ['.', '.', '.', '.'],
... 'REF': ['G', 'T', 'A', 'C'],
... 'ALT': ['A', 'C', 'T', 'A'],
... 'QUAL': ['.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT'],
... 'GroupA_1': ['1/0', '0/0', '0/0', '0/1'],
... 'GroupA_2': ['1/1', '0/1', '0/0', '0/0'],
... 'GroupB_1': ['0/0', '0/0', '1/1', '0/0'],
... 'GroupB_2': ['0/0', '0/1', '1/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf') # in your case
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GroupA_1 GroupA_2 GroupB_1 GroupB_2
0 chr1 100 . G A . . . GT 1/0 1/1 0/0 0/0
1 chr1 101 . T C . . . GT 0/0 0/1 0/0 0/1
2 chr1 102 . A T . . . GT 0/0 0/0 1/1 1/1
3 chr1 103 . C A . . . GT 0/1 0/0 0/0 0/0
>>> vf_a = vf.subset(['GroupA_1', 'GroupA_2'])
>>> vf_b = vf.subset(['GroupB_1', 'GroupB_2'])
>>> vf_a = vf_a.compute_info('AF')
>>> vf_b = vf_b.compute_info('AF')
>>> vf_a.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GroupA_1 GroupA_2
0 chr1 100 . G A . . AF=0.750 GT 1/0 1/1
1 chr1 101 . T C . . AF=0.250 GT 0/0 0/1
2 chr1 102 . A T . . AF=0.000 GT 0/0 0/0
3 chr1 103 . C A . . AF=0.250 GT 0/1 0/0
>>> vf_b.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GroupB_1 GroupB_2
0 chr1 100 . G A . . AF=0.000 GT 0/0 0/0
1 chr1 101 . T C . . AF=0.250 GT 0/0 0/1
2 chr1 102 . A T . . AF=1.000 GT 1/1 1/1
3 chr1 103 . C A . . AF=0.000 GT 0/0 0/0
>>> vf_a.extract_info('#AF')
0 0.75
1 0.25
2 0.00
3 0.25
dtype: float64
>>> vf_b.extract_info('#AF')
0 0.00
1 0.25
2 1.00
3 0.00
dtype: float64
>>> vf_a.extract_info('#AF') - vf_b.extract_info('#AF')
0 0.75
1 0.00
2 -1.00
3 0.25
dtype: float64
Important disclaimer: the pyvcf.VcfFrame.compute_info
method was recently introduced to the 0.31.0-dev
branch; therefore, to use this method, you need to install the development version of fuc
:
$ conda create -n fuc -c bioconda fuc
$ conda activate fuc
$ git clone https://github.com/sbslee/fuc
$ cd fuc
$ git checkout 0.31.0-dev
$ pip install .
$ fuc -v
Let me know in the comment section if you have any questions.
I would actually just use the pysam package instead of writing your own parser. The VCF format is very complicated and it is not recommended to parse the information yourself. It seems like sbstevenlee 's package also uses pysam in the background to parse the VCF files so that's a good thing in regards to his package.
You could also use bcftools' query command to get a tab delimited output and parse that with python:
Note that this is just the first 3 rows