Need help to read samples from PyVCF
1
1
Entering edit mode
2.8 years ago
anasjamshed ▴ 140

I have a test vcf file and I want to read and extract AD and DP values from the sample column of 2 different populations. I am trying this:

import os
import pandas as pd


def read_vcf(path1):
    with open(path1, 'r') as f:
            lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})



path1 = "C://Users//USER//Desktop//anas/VCFs_1/test_1.vcf"

file =read_vcf(path1)


pop1 = file[["NEN_001","NEN_003","NEN_200","NEN_300","LAB_004","LAB_300","LAB_400","LAB_500"]]

pop2 = file[["ROT_004","ROT_006", "ROT_007","ROT_013" ,"SKN_001" ,"SKN_002","SKN_005","SKN_008"]]

First, I need to extract AD and then DP from pop1 and then I need to divide AD by DP to get AF1 values, after that, I need to do the same process to calculate AF2. After getting both AF1 and AF2 I need to use the AF1 - AF2 formula to calculate the allele frequency difference.

Can anyone guide me?

python VCF • 2.6k views
ADD COMMENT
0
Entering edit mode

I would actually just use the pysam package instead of writing your own parser. The VCF format is very complicated and it is not recommended to parse the information yourself. It seems like sbstevenlee 's package also uses pysam in the background to parse the VCF files so that's a good thing in regards to his package.

You could also use bcftools' query command to get a tab delimited output and parse that with python:

Note that this is just the first 3 rows

bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL[\t%AD\t%DP]\n' -H -s NEN_001,NEN_003 /path/to/vcf.gz | head -n 3 | column -t

\#     [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  [6]QUAL  [7]NEN_001:AD  [8]NEN_001:DP  [9]NEN_003:AD  [10]NEN_003:DP
chr1  10111     .       C       *,A     113.31  55,0,0   55              56,0,0          56
chr1  10114     .       TA      *,CA    34.18   40,0,0   40              55,0,0          55
ADD REPLY
2
Entering edit mode
2.7 years ago
sbstevenlee ▴ 480

Below solution uses the pyvcf submodule from the fuc package I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C'],
...     'ALT': ['A', 'C', 'T', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'GroupA_1': ['1/0', '0/0', '0/0', '0/1'],
...     'GroupA_2': ['1/1', '0/1', '0/0', '0/0'],
...     'GroupB_1': ['0/0', '0/0', '1/1', '0/0'],
...     'GroupB_2': ['0/0', '0/1', '1/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf') # in your case
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT GroupA_1 GroupA_2 GroupB_1 GroupB_2
0  chr1  100  .   G   A    .      .    .     GT      1/0      1/1      0/0      0/0
1  chr1  101  .   T   C    .      .    .     GT      0/0      0/1      0/0      0/1
2  chr1  102  .   A   T    .      .    .     GT      0/0      0/0      1/1      1/1
3  chr1  103  .   C   A    .      .    .     GT      0/1      0/0      0/0      0/0
>>> vf_a = vf.subset(['GroupA_1', 'GroupA_2'])
>>> vf_b = vf.subset(['GroupB_1', 'GroupB_2'])
>>> vf_a = vf_a.compute_info('AF')
>>> vf_b = vf_b.compute_info('AF')
>>> vf_a.df
  CHROM  POS ID REF ALT QUAL FILTER      INFO FORMAT GroupA_1 GroupA_2
0  chr1  100  .   G   A    .      .  AF=0.750     GT      1/0      1/1
1  chr1  101  .   T   C    .      .  AF=0.250     GT      0/0      0/1
2  chr1  102  .   A   T    .      .  AF=0.000     GT      0/0      0/0
3  chr1  103  .   C   A    .      .  AF=0.250     GT      0/1      0/0
>>> vf_b.df
  CHROM  POS ID REF ALT QUAL FILTER      INFO FORMAT GroupB_1 GroupB_2
0  chr1  100  .   G   A    .      .  AF=0.000     GT      0/0      0/0
1  chr1  101  .   T   C    .      .  AF=0.250     GT      0/0      0/1
2  chr1  102  .   A   T    .      .  AF=1.000     GT      1/1      1/1
3  chr1  103  .   C   A    .      .  AF=0.000     GT      0/0      0/0
>>> vf_a.extract_info('#AF')
0    0.75
1    0.25
2    0.00
3    0.25
dtype: float64
>>> vf_b.extract_info('#AF')
0    0.00
1    0.25
2    1.00
3    0.00
dtype: float64
>>> vf_a.extract_info('#AF') - vf_b.extract_info('#AF')
0    0.75
1    0.00
2   -1.00
3    0.25
dtype: float64

Important disclaimer: the pyvcf.VcfFrame.compute_info method was recently introduced to the 0.31.0-dev branch; therefore, to use this method, you need to install the development version of fuc:

$ conda create -n fuc -c bioconda fuc
$ conda activate fuc
$ git clone https://github.com/sbslee/fuc
$ cd fuc
$ git checkout 0.31.0-dev
$ pip install .
$ fuc -v

Let me know in the comment section if you have any questions.

ADD COMMENT
0
Entering edit mode

how can I make allele frequency difference plots?

ADD REPLY
0
Entering edit mode

That will depend on how you define an "allele frequency difference plot". What kind of plot do you want? Is is simply a histogram of all the allele frequency differences to show the distribution?

ADD REPLY

Login before adding your answer.

Traffic: 879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6