Filter GQ , DP and Allele Balance in VCF
1
0
Entering edit mode
2.9 years ago

Hi all,

Trying to calculate Allele Balance (it is not already included in the vcf file) and then filter for .2<=AB<=.8 for each heterozygous genotype in all samples as well as GQ and DP for each sample. I am struggling to find the correct syntax for the arithmetic expression AD{0}/DP or AD{1}/DP. Here is what I have so far:

bcftools filter -i 'GQ>=20 & DP>=10 & sum(AD[1:]/DP)>=.2 & sum(AD[1:]/DP)<=.8' input.vcf.gz > output.vcf

The code does not appear to make any changes in the output file.

Any input is appreciated.

Thanks!

bcftools balance allele alllele vcf • 1.9k views
ADD COMMENT
0
Entering edit mode
2.9 years ago
sbstevenlee ▴ 480

Check out the fuc package I wrote:

API solution (Python)

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'G'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT:DP:AD:GQ', 'GT:DP:AD:GQ', 'GT:DP:AD:GQ'],
...     'A': ['0/0:26:0,26:30', '0/1:32:16,16:19', '0/0:.:.:.'],
...     'B': ['./.:.:.:.', '0/0:31:29,2:60', './.:.:.:.'],
...     'C': ['0/1:18:12,6:40', '0/0:24:24,0:12', '1/1:8:0,8:0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO       FORMAT                A               B               C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD:GQ   0/0:26:0,26:30       ./.:.:.:.  0/1:18:12,6:40
1  chr1  101  .   T   C    .      .    .  GT:DP:AD:GQ  0/1:32:16,16:19  0/0:31:29,2:60  0/0:24:24,0:12
2  chr1  102  .   T   G    .      .    .  GT:DP:AD:GQ        0/0:.:.:.       ./.:.:.:.     1/1:8:0,8:0
>>> filtered_vf = vf.markmiss('GQ >= 20 and DP >= 10 and 0.2 <= sum(AD[1:])/DP <= 0.8', opposite=True)
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO       FORMAT          A          B               C
0  chr1  100  .   G   A    .      .    .  GT:DP:AD:GQ  ./.:.:.:.  ./.:.:.:.  0/1:18:12,6:40
1  chr1  101  .   T   C    .      .    .  GT:DP:AD:GQ  ./.:.:.:.  ./.:.:.:.       ./.:.:.:.
2  chr1  102  .   T   G    .      .    .  GT:DP:AD:GQ  ./.:.:.:.  ./.:.:.:.       ./.:.:.:.
>>> # filtered_vf.to_file('filtered.vcf')

CLI solution

$ fuc vcf_filter in.vcf 'GQ >= 20 and DP >= 10 and 0.2 <= sum(AD[1:])/DP <= 0.8' --opposite > out.vcf

For getting help:

$ fuc vcf_filter -h
usage: fuc vcf_filter [-h] [--greedy] [--opposite] [--samples PATH] vcf expr

This command will filter a VCF file (both zipped and unzipped). It essentially
wraps the 'pyvcf.VcfFrame.markmiss' method from the fuc API.

usage examples:
  $ fuc vcf_filter in.vcf 'GT == "0/0"' > out.vcf
  $ fuc vcf_filter in.vcf 'GT != "0/0"' > out.vcf
  $ fuc vcf_filter in.vcf 'DP < 30' > out.vcf
  $ fuc vcf_filter in.vcf 'DP < 30' --greedy > out.vcf
  $ fuc vcf_filter in.vcf 'AD[1] < 10' --greedy > out.vcf
  $ fuc vcf_filter in.vcf 'AD[1] < 10 and DP < 30' --greedy > out.vcf
  $ fuc vcf_filter in.vcf 'AD[1] < 10 or DP < 30' --greedy > out.vcf
  $ fuc vcf_filter in.vcf 'AD[1] < 10 or DP < 30' --opposite > out.vcf
  $ fuc vcf_filter in.vcf 'np.mean(AD) < 10' --greedy --samples sample.list > out.vcf

positional arguments:
  vcf             VCF file
  expr            expression to evaluate

optional arguments:
  -h, --help      show this help message and exit
  --greedy        use this flag to mark even ambiguous genotypes as missing
  --opposite      use this flag to mark all genotypes that do not satisfy the
                  query expression as missing and leave those that do intact
  --samples PATH  file of sample names to apply the marking (one sample per
                  line)
ADD COMMENT

Login before adding your answer.

Traffic: 1767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6