Question

BCFtools allele frequency for specific population

0

Entering edit mode

2.1 years ago

Dean • 0

Hi,

Apologies if this is a silly question or has an obvious answer. I am trying to output five VCF files using bcftools filtered by a minimum minor allele frequency (minMAF) of 0.05 within specific population(s) of individuals. I run the command:

bcftools view .../ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
  -S .../${POP}file \
  -o ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_${POP} \
  -m2 \
  -M2 \
  -v snps \
  -q 0.05:minor \
  -O z

Does this command mean that only SNPs with a minMAF of 0.05 within the individuals I specify in -S will be kept or it is minMAF across all individuals in the input VCF? Ideally, I would like to achieve the former.

Thank you

bcftools allele-frequency • 3.4k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 2.1 years ago by Dean • 0

score 0 · Answer 1 · 2022-03-16

0

Entering edit mode

2.1 years ago

raphael.B ▴ 520

By doing so, you'll be filtering on the MAF computed on the whole VCF. To get the MAF corresponding to your subpopulation you can use the bcftools +fill-tags plugin, with a command like the following one (not tested):

bcftools view .../ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz -S .../${POP}file  -m2 -M2 -v snps -Ou |bcftools +fill-tags -t AF -Ou | bcftools view -Oz -q 0.05:minor -o ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_${POP}

ADD COMMENT • link 2.1 years ago by raphael.B ▴ 520

0

Entering edit mode

Hi Raphael,

Thank you very much for this solution, I will test and reply again if there are issues.

Dean

ADD REPLY • link 2.1 years ago by Dean • 0

score 0 · Answer 2 · 2022-03-17

If you don't mind a Python solution, below is one using the pyvcf submodule from the fuc package I wrote.

Let's say your VCF contains samples from two populations A and B.

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'A'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A1': ['0/1', '0/1', '0/1'],
...     'A2': ['0/0', '1/1', '0/0'],
...     'A3': ['0/1', '0/1', '0/0'],
...     'B1': ['0/0', '0/1', '1/1'],
...     'B2': ['0/0', '1/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT   A1   A2   A3   B1   B2
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/1  0/1  1/1
2  chr1  102  .   T   A    .      .    .     GT  0/1  0/0  0/0  1/1  0/1

Identify variants with AF >= 0.2 in population A.

>>> A_vf = vf.subset(['A1', 'A2', 'A3'])
>>> A_vf = A_vf.compute_info('AF')
>>> A_vf.df
  CHROM  POS ID REF ALT QUAL FILTER      INFO FORMAT   A1   A2   A3
0  chr1  100  .   G   A    .      .  AF=0.333     GT  0/1  0/0  0/1
1  chr1  101  .   T   C    .      .  AF=0.667     GT  0/1  1/1  0/1
2  chr1  102  .   T   A    .      .  AF=0.167     GT  0/1  0/0  0/0

Actually apply the filtering.

>>> i = A_vf.extract_info('#AF') > 0.2
>>> filtered_vf = pyvcf.VcfFrame(vf.copy_meta(), vf.df[i])
>>> filtered_vf.df

  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT   A1   A2   A3   B1   B2
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/1  0/1  1/1

Optionally write output VCF.

# filtered_vf.to_file('out.vcf')

Let me know if you have any questions.