Question

Need help to read samples from PyVCF

1

Entering edit mode

2.8 years ago

anasjamshed ▴ 140

I have a test vcf file and I want to read and extract AD and DP values from the sample column of 2 different populations. I am trying this:

import os
import pandas as pd


def read_vcf(path1):
    with open(path1, 'r') as f:
            lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})



path1 = "C://Users//USER//Desktop//anas/VCFs_1/test_1.vcf"

file =read_vcf(path1)


pop1 = file[["NEN_001","NEN_003","NEN_200","NEN_300","LAB_004","LAB_300","LAB_400","LAB_500"]]

pop2 = file[["ROT_004","ROT_006", "ROT_007","ROT_013" ,"SKN_001" ,"SKN_002","SKN_005","SKN_008"]]

First, I need to extract AD and then DP from pop1 and then I need to divide AD by DP to get AF1 values, after that, I need to do the same process to calculate AF2. After getting both AF1 and AF2 I need to use the AF1 - AF2 formula to calculate the allele frequency difference.

Can anyone guide me?

python VCF • 2.6k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 2.8 years ago by anasjamshed ▴ 140

0

Entering edit mode

I would actually just use the pysam package instead of writing your own parser. The VCF format is very complicated and it is not recommended to parse the information yourself. It seems like sbstevenlee 's package also uses pysam in the background to parse the VCF files so that's a good thing in regards to his package.

You could also use bcftools' query command to get a tab delimited output and parse that with python:

Note that this is just the first 3 rows

bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL[\t%AD\t%DP]\n' -H -s NEN_001,NEN_003 /path/to/vcf.gz | head -n 3 | column -t

\#     [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  [6]QUAL  [7]NEN_001:AD  [8]NEN_001:DP  [9]NEN_003:AD  [10]NEN_003:DP
chr1  10111     .       C       *,A     113.31  55,0,0   55              56,0,0          56
chr1  10114     .       TA      *,CA    34.18   40,0,0   40              55,0,0          55

ADD REPLY • link 2.7 years ago by Robert ▴ 10

score 2 · Answer 1 · 2022-02-16

Below solution uses the pyvcf submodule from the fuc package I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'GroupA_1': ['1/0', '0/0', '0/0', '0/1'],
...     'GroupA_2': ['1/1', '0/1', '0/0', '0/0'],
...     'GroupB_1': ['0/0', '0/0', '1/1', '0/0'],
...     'GroupB_2': ['0/0', '0/1', '1/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf') # in your case
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT GroupA_1 GroupA_2 GroupB_1 GroupB_2
0  chr1  100  .   G   A    .      .    .     GT      1/0      1/1      0/0      0/0
1  chr1  101  .   T   C    .      .    .     GT      0/0      0/1      0/0      0/1
2  chr1  102  .   A   T    .      .    .     GT      0/0      0/0      1/1      1/1
3  chr1  103  .   C   A    .      .    .     GT      0/1      0/0      0/0      0/0
>>> vf_a = vf.subset(['GroupA_1', 'GroupA_2'])
>>> vf_b = vf.subset(['GroupB_1', 'GroupB_2'])
>>> vf_a = vf_a.compute_info('AF')
>>> vf_b = vf_b.compute_info('AF')
>>> vf_a.df
  CHROM  POS ID REF ALT QUAL FILTER      INFO FORMAT GroupA_1 GroupA_2
0  chr1  100  .   G   A    .      .  AF=0.750     GT      1/0      1/1
1  chr1  101  .   T   C    .      .  AF=0.250     GT      0/0      0/1
2  chr1  102  .   A   T    .      .  AF=0.000     GT      0/0      0/0
3  chr1  103  .   C   A    .      .  AF=0.250     GT      0/1      0/0
>>> vf_b.df
  CHROM  POS ID REF ALT QUAL FILTER      INFO FORMAT GroupB_1 GroupB_2
0  chr1  100  .   G   A    .      .  AF=0.000     GT      0/0      0/0
1  chr1  101  .   T   C    .      .  AF=0.250     GT      0/0      0/1
2  chr1  102  .   A   T    .      .  AF=1.000     GT      1/1      1/1
3  chr1  103  .   C   A    .      .  AF=0.000     GT      0/0      0/0
>>> vf_a.extract_info('#AF')
0    0.75
1    0.25
2    0.00
3    0.25
dtype: float64
>>> vf_b.extract_info('#AF')
0    0.00
1    0.25
2    1.00
3    0.00
dtype: float64
>>> vf_a.extract_info('#AF') - vf_b.extract_info('#AF')
0    0.75
1    0.00
2   -1.00
3    0.25
dtype: float64

Important disclaimer: the pyvcf.VcfFrame.compute_info method was recently introduced to the 0.31.0-dev branch; therefore, to use this method, you need to install the development version of fuc:

$ conda create -n fuc -c bioconda fuc
$ conda activate fuc
$ git clone https://github.com/sbslee/fuc
$ cd fuc
$ git checkout 0.31.0-dev
$ pip install .
$ fuc -v

Let me know in the comment section if you have any questions.