Filter vcf file based on the GTs of a single sample
2
0
Entering edit mode
2.7 years ago

Hi, I have a multisample vcf, and I am trying to keep only those variants that are biallelic and homozygous in one of the samples.How can I achieve this and also keep the GTs for the other samples at the position in which this condition is true.

bcftools vcf samtools bcf • 773 views
ADD COMMENT
1
Entering edit mode
2.7 years ago

use GATK SelectVariants https://gatk.broadinstitute.org/hc/en-us/articles/360037055952-SelectVariants with a JEXL expression like

  -select 'vc.getGenotype("NA12878").isHomVar()'
ADD COMMENT
1
Entering edit mode
2.7 years ago
sbstevenlee ▴ 480

Here's a Python API solution using the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C', 'T'],
...     'ALT': ['C,T', 'T', 'G', 'G,A', 'A'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/2', '0/0', '0/1', './.', '0/1'],
...     'B': ['0/1', '1/1', './.', '1/2', '1/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A  C,T    .      .    .     GT  0/2  0/1
1  chr1  101  .   A    T    .      .    .     GT  0/0  1/1
2  chr1  102  .   C    G    .      .    .     GT  0/1  ./.
3  chr1  103  .   C  G,A    .      .    .     GT  ./.  1/2
4  chr1  104  .   T    A    .      .    .     GT  0/1  1/1
>>> # Remove multiallelic variants
>>> vf = vf.filter_multialt()
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  101  .   A   T    .      .    .     GT  0/0  1/1
1  chr1  102  .   C   G    .      .    .     GT  0/1  ./.
2  chr1  104  .   T   A    .      .    .     GT  0/1  1/1
>>> # Select only variants where the sample B is homozygous
>>> def one_row(r):
...     return r['B'].split('/')[0] == r['B'].split('/')[1]
... 
>>> i = vf.df.apply(one_row, axis=1)
>>> vf.df = vf.df[i]
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  101  .   A   T    .      .    .     GT  0/0  1/1
1  chr1  102  .   C   G    .      .    .     GT  0/1  ./.
2  chr1  104  .   T   A    .      .    .     GT  0/1  1/1
>>> vf.to_file('out.vcf')
ADD COMMENT

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6