Filter on Allele Balance using BCFTools
1
0
Entering edit mode
2.7 years ago

Hi All,

I need to filter my variants based on the following criteria.

1) Include SNP sites with at least one heterozygous with allele balance(AB) > 0.15 or at least one homozygous variant

2) Include INDEL sites with at least one heterozygous with allele balance(AB) > 0.20 or at least one homozygous variant

This is the bcftools command I am trying to use. Allele Balance was taken as: No of reads with Alt allele/Total no of reads

bcftools view --threads 10 -i '(TYPE="SNP" && N_PASS(GT="het" & FMT/AD[*:1]/(FMT/AD[*:0]+FMT/AD[*:1])>=0.15) >= 1) || (TYPE="SNP" && COUNT(GT="AA") >= 1) || (TYPE="INDEL" && N_PASS(GT="het" & FMT/AD[*:1]/(FMT/AD[*:0]+FMT/AD[*:1])>=0.20) >= 1) || (TYPE="INDEL" && COUNT(GT="AA") >= 1)' -Oz -o $in/fil1/input.vcf.gz

Just want to confirm what I am doing is correct. Appreciate your feedback.

Thank you Best, Sumudu

bcftools allele filter balance • 2.2k views
ADD COMMENT
0
Entering edit mode

I am wondering why we set AB>0.15 without AB<0.85?? Thanks

ADD REPLY
1
Entering edit mode
2.7 years ago
sbstevenlee ▴ 480

Here's another solution based on Python API using the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 110, 120, 130, 140],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'AT', 'G', 'T'],
...     'ALT': ['A', 'C', 'A', 'GT', 'TT'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT:AD', 'GT:AD', 'GT:AD', 'GT:AD', 'GT:AD'],
...     'A': ['0/1:84,16', '0/1:90,10', '1/1:2,98', '0/1:79,21', '0/0:100,0'],
...     'B': ['0/1:95,5', '0/1:92,8', '0/0:100,0', '0/0:100,0', '0/1:80,20']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT          A          B
0  chr1  100  .   G   A    .      .    .  GT:AD  0/1:84,16   0/1:95,5
1  chr1  110  .   T   C    .      .    .  GT:AD  0/1:90,10   0/1:92,8
2  chr1  120  .  AT   A    .      .    .  GT:AD   1/1:2,98  0/0:100,0
3  chr1  130  .   G  GT    .      .    .  GT:AD  0/1:79,21  0/0:100,0
4  chr1  140  .   T  TT    .      .    .  GT:AD  0/0:100,0  0/1:80,20
>>>
>>> def one_row(r):
...     ad_index = r.FORMAT.split(':').index('AD')
...     
...     def one_gt(g, threshold):
...         ad_list = [int(x) for x in g.split(':')[ad_index].split(',')]
...         ref = ad_list[0]
...         alt = ad_list[1]
...         return alt / (ref + alt) > threshold
...     
...     if pyvcf.row_hasindel(r):
...         s = r[9:].apply(one_gt, args=(0.2,))
...     else:
...         s = r[9:].apply(one_gt, args=(0.15,))
...         
...     return s.any()
... 
>>> i = vf.df.apply(one_row, axis=1)
>>> 
>>> filtered_vf = pyvcf.VcfFrame(vf.copy_meta(), vf.df[i])
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT          A          B
0  chr1  100  .   G   A    .      .    .  GT:AD  0/1:84,16   0/1:95,5
1  chr1  120  .  AT   A    .      .    .  GT:AD   1/1:2,98  0/0:100,0
2  chr1  130  .   G  GT    .      .    .  GT:AD  0/1:79,21  0/0:100,0
>>>
>>> # filtered_vf.to_file('filtered.vcf')
ADD COMMENT

Login before adding your answer.

Traffic: 2707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6