BCFtools allele frequency for specific population
2
0
Entering edit mode
2.1 years ago
Dean • 0

Hi,

Apologies if this is a silly question or has an obvious answer. I am trying to output five VCF files using bcftools filtered by a minimum minor allele frequency (minMAF) of 0.05 within specific population(s) of individuals. I run the command:

bcftools view .../ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
  -S .../${POP}file \
  -o ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_${POP} \
  -m2 \
  -M2 \
  -v snps \
  -q 0.05:minor \
  -O z

Does this command mean that only SNPs with a minMAF of 0.05 within the individuals I specify in -S will be kept or it is minMAF across all individuals in the input VCF? Ideally, I would like to achieve the former.

Thank you

bcftools allele-frequency • 3.3k views
ADD COMMENT
0
Entering edit mode
2.1 years ago
raphael.B ▴ 520

By doing so, you'll be filtering on the MAF computed on the whole VCF. To get the MAF corresponding to your subpopulation you can use the bcftools +fill-tags plugin, with a command like the following one (not tested):

bcftools view .../ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz -S .../${POP}file  -m2 -M2 -v snps -Ou |bcftools +fill-tags -t AF -Ou | bcftools view -Oz -q 0.05:minor -o ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_${POP}
ADD COMMENT
0
Entering edit mode

Hi Raphael,

Thank you very much for this solution, I will test and reply again if there are issues.

Dean

ADD REPLY
0
Entering edit mode
2.1 years ago
sbstevenlee ▴ 480

If you don't mind a Python solution, below is one using the pyvcf submodule from the fuc package I wrote.

Let's say your VCF contains samples from two populations A and B.

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'A'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A1': ['0/1', '0/1', '0/1'],
...     'A2': ['0/0', '1/1', '0/0'],
...     'A3': ['0/1', '0/1', '0/0'],
...     'B1': ['0/0', '0/1', '1/1'],
...     'B2': ['0/0', '1/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT   A1   A2   A3   B1   B2
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/1  0/1  1/1
2  chr1  102  .   T   A    .      .    .     GT  0/1  0/0  0/0  1/1  0/1

Identify variants with AF >= 0.2 in population A.

>>> A_vf = vf.subset(['A1', 'A2', 'A3'])
>>> A_vf = A_vf.compute_info('AF')
>>> A_vf.df
  CHROM  POS ID REF ALT QUAL FILTER      INFO FORMAT   A1   A2   A3
0  chr1  100  .   G   A    .      .  AF=0.333     GT  0/1  0/0  0/1
1  chr1  101  .   T   C    .      .  AF=0.667     GT  0/1  1/1  0/1
2  chr1  102  .   T   A    .      .  AF=0.167     GT  0/1  0/0  0/0

Actually apply the filtering.

>>> i = A_vf.extract_info('#AF') > 0.2
>>> filtered_vf = pyvcf.VcfFrame(vf.copy_meta(), vf.df[i])
>>> filtered_vf.df

  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT   A1   A2   A3   B1   B2
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0  0/1  0/0  0/0
1  chr1  101  .   T   C    .      .    .     GT  0/1  1/1  0/1  0/1  1/1

Optionally write output VCF.

# filtered_vf.to_file('out.vcf')

Let me know if you have any questions.

ADD COMMENT
0
Entering edit mode

Hi,

Thank you for your solution.

ADD REPLY

Login before adding your answer.

Traffic: 2362 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6