Question

how to filter out all the homozygous genotypes in a multi sample VCF file?

0

Entering edit mode

2.8 years ago

Nahid Bandi K • 0

Hi all,

is there way to filter out all the homozygous genotypes in a multi sample VCF file? in my vcf file I have unknown (./.) genotype, reference genotype (0/0) and two alternative genotypes (1/1 and 2/2). I want to remove these genotypes using SnpSift.jar tool.

I have used the following command to remove them:

$ cat SNPs_reheader_annot_passed.vcf.gz | java -jar /home/bandiken/snpEff/SnpSift.jar filter \
"(countHet() > 0)" > SNPs_reheader_annot_Het.vcf.gz

However in the final output file (the following screenshot) along with a few heterozygous genotypes in each row, we can also see the other homozygous genotypes (unwanted genotypes in my study). in the following screenshot you can see that in every location/row there are a few heterozygous genotypes along with multiple homozygous genotypes

so my question is that is it normal?

is there a way to remove the samples with homozygous genotypes without the whole row being removed?

Any help would be appreciated.

Thank you.

vcf • 2.5k views

ADD COMMENT • link 2.7 years ago by Nahid Bandi K • 0

1

Entering edit mode

Well, the genotypes are aligned to the samples in the column and SNPs in the row. If you want to remove them, you'll have to remove all the rows/columns with the 0/1 genotype. So do you mean to subsititute the 0/1 to NA?

ADD REPLY • link 2.8 years ago by Zeng Jingyu ▴ 60

0

Entering edit mode

Thank you for your response. no actually I was wondering if I can remove all the homozygous genotypes to have a vcf file with just heterozygous genotypes or not.

so is it possible to substitute all the homozygous genotypes ( in my case 0/0, 1/1, 2/2 and ./.) with NA? because I do not want any homozygous genotype in my vcf file.

I only want heterozygous genotypes (0/1, 0/2 and1/2) in my vcf file.

Thank you again.

ADD REPLY • link 2.8 years ago by Nahid Bandi K • 0

score 1 · Answer 1 · 2021-07-22

1

Entering edit mode

2.8 years ago

Zeng Jingyu ▴ 60

Well, how about try to replace 0/0, 1/1, 2/2 to ./. ?

./. means no variant, I guess it may work? ./. in some way, is familar to missing or NA.

I don't know what you're doing, but this is the easiest method I can provide.

You can also try Plink, but I think it's not necessary

ADD COMMENT • link 2.8 years ago by Zeng Jingyu ▴ 60

0

Entering edit mode

Thank you zengjingyu and sbstevenlee for your help and explanations. I guess as sbstevenlee explained if I remove all the homozygous genotypes the VCF file format gets corrupted.

ADD REPLY • link 2.7 years ago by Nahid Bandi K • 0

score 1 · Answer 2 · 2021-07-22

I agree with zengjingyu about switching 0/0, 1/1, 2/2 to ./. rather than to NA, because otherwise the resulting file would not be VCF anymore (incorrect GT format). In any case, assuming you have a very good reason to do this regardless, here is a Python API solution using the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'A'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A': ['0/1', '1/1', './.'],
...     'B': ['0/0', '2/2', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('input.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0
1  chr1  101  .   T   C    .      .    .     GT  1/1  2/2
2  chr1  102  .   T   A    .      .    .     GT  ./.  0/1
>>> def f(x):
...     a = x.split('/')[0]
...     b = x.split('/')[1]
...     if a == b:
...         return 'NA'
...     else:
...         return x
... 
>>> vf.df.iloc[:, 9:] = vf.df.iloc[:, 9:].applymap(f)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/1   NA
1  chr1  101  .   T   C    .      .    .     GT   NA   NA
2  chr1  102  .   T   A    .      .    .     GT   NA  0/1
>>> vf.to_file('technically_not_vcf_anymore.vcf')