how to filter out all the homozygous genotypes in a multi sample VCF file?
2
0
Entering edit mode
9 days ago

Hi all,

is there way to filter out all the homozygous genotypes in a multi sample VCF file? in my vcf file I have unknown (./.) genotype, reference genotype (0/0) and two alternative genotypes (1/1 and 2/2). I want to remove these genotypes using SnpSift.jar tool.

I have used the following command to remove them:

\$ cat SNPs_reheader_annot_passed.vcf.gz | java -jar /home/bandiken/snpEff/SnpSift.jar filter \


However in the final output file (the following screenshot) along with a few heterozygous genotypes in each row, we can also see the other homozygous genotypes (unwanted genotypes in my study).

so my question is that is it normal?

is there a way to remove the samples with homozygous genotypes without the whole row being removed?

Any help would be appreciated.

Thank you.

vcf • 362 views
1
Entering edit mode

Well, the genotypes are aligned to the samples in the column and SNPs in the row. If you want to remove them, you'll have to remove all the rows/columns with the 0/1 genotype. So do you mean to subsititute the 0/1 to NA?

0
Entering edit mode

Thank you for your response. no actually I was wondering if I can remove all the homozygous genotypes to have a vcf file with just heterozygous genotypes or not.

so is it possible to substitute all the homozygous genotypes ( in my case 0/0, 1/1, 2/2 and ./.) with NA? because I do not want any homozygous genotype in my vcf file.

I only want heterozygous genotypes (0/1, 0/2 and1/2) in my vcf file.

Thank you again.

1
Entering edit mode
6 days ago
sbstevenlee ▴ 140

I agree with zengjingyu about switching 0/0, 1/1, 2/2 to ./. rather than to NA, because otherwise the resulting file would not be VCF anymore (incorrect GT format). In any case, assuming you have a very good reason to do this regardless, here is a Python API solution using the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'T'],
...     'ALT': ['A', 'C', 'A'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'A': ['0/1', '1/1', './.'],
...     'B': ['0/0', '2/2', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('input.vcf')
>>> vf.df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/0
1  chr1  101  .   T   C    .      .    .     GT  1/1  2/2
2  chr1  102  .   T   A    .      .    .     GT  ./.  0/1
>>> def f(x):
...     a = x.split('/')[0]
...     b = x.split('/')[1]
...     if a == b:
...         return 'NA'
...     else:
...         return x
...
>>> vf.df.iloc[:, 9:] = vf.df.iloc[:, 9:].applymap(f)
>>> vf.df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/1   NA
1  chr1  101  .   T   C    .      .    .     GT   NA   NA
2  chr1  102  .   T   A    .      .    .     GT   NA  0/1
>>> vf.to_file('technically_not_vcf_anymore.vcf')

0
Entering edit mode

Nice method! Nahid can also try this to replace homozygous genotypes to ./.

0
Entering edit mode
6 days ago
zengjingyu ▴ 10

Well, how about try to replace 0/0, 1/1, 2/2 to ./. ?

./. means no variant, I guess it may work? ./. in some way, is familar to missing or NA.

I don't know what you're doing, but this is the easiest method I can provide.

You can also try Plink, but I think it's not necessary

Traffic: 1725 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.