Question: Opposite genotype filtering in multisample vcf for 2 population
0
4.9 years ago by
ck7912340
United Kingdom
ck7912340 wrote:

Hi,

I'm trying to filter for opposite genotype from multisample vcf (8 samples) which consist of 2 population (4 samples/population) while maintain the vcf format.

What i want to achieved are as follow:

1. Filter for same genotype in all sample in Population A while genotype in Population B are totally different than Population A.

example: 1/1 in all sample in Population A (s1-s4), none 1/1 in Population B (a1-a4) (genotype in Population B can be a mixture of 0/0 and 0/1).

#desire genotype output

s1      s2     s3    s4     a1    a2     a3     a4

1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/0

1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1

and so on..

2. Flip the genotype filtering as in (1) where genotype in Population B are same while genotype in Population A is not same as Population B.

#desire genotype output

s1      s2     s3    s4     a1    a2     a3     a4

0/0    0/0    0/0    0/0    1/1    1/1    1/1    1/1

0/0    0/0    0/0    0/1    1/1    1/1    1/1    1/1

and so on..

I tried with SnpSift, however the filter missed 26 combination when involve 1/1 genotype.

The command i used as follow:

cat sample.vcf | java -Xmx4g -jar  SnpSift.jar filter  "(((countHom()>3)&(countHom()<5))|(countHet()=4)|(countRef()=4)) &((isRef(GEN[0]) & isRef(GEN[1]) & isRef(GEN[2]) &  isRef(GEN[3]))|(isHom(GEN[0]) & isVariant(GEN[0]) & isHom(GEN[1])& isVariant(GEN[1]) & isHom(GEN[2]) & isVariant(GEN[2])& isHom(GEN[3]) & isVariant(GEN[3])) | (isHet(GEN[0]) & isHet(GEN[1]) & isHet(GEN[2]) & isHet(GEN[3]))| (isRef(GEN[4]) & isRef(GEN[5]) & isRef(GEN[6]) &  isRef(GEN[7]))|(isHom(GEN[4]) & isVariant(GEN[4]) & isHom(GEN[5])& isVariant(GEN[5]) & isHom(GEN[6]) & isVariant(GEN[6])& isHom(GEN[7]) & isVariant(GEN[7])) | (isHet(GEN[4]) & isHet(GEN[5]) & isHet(GEN[6]) & isHet(GEN[7])))"

Does anyone has experience in this filtering?  I  appreciate any help/advise.

Many thanks

snp genotype snpsift vcf • 1.6k views
modified 4.9 years ago • written 4.9 years ago by ck7912340

Thanks for all the helps & advise.

Script for the opposite filtering with https://github.com/lindenb/jvarkit/wiki/VCFFilterJS:

``````function accept(v)
{
var pop1=["s1","s2","s3","s4"];
var pop2=["a1","a2","a3","a4"];
var i;
var g1 = v.getGenotype(pop1[0]);
var g2 = v.getGenotype(pop2[0]);
var g1issame = 1;

if( !g1.isCalled() ) return false;
if( !g2.isCalled() ) return false;

for(i=1;i < pop1.length; ++i)
{
if( ! g1.sameGenotype( v.getGenotype(pop1[i]) ) ) g1issame = 0;
}

if( g1issame == 1 )
{
for(i=0;i < pop2.length; ++i)
{
if( g1.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
}
else
{
for(i=1;i < pop2.length;++i)
{
if( ! g2.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
for(i=0;i< pop1.length; ++i)
{
if( g2.sameGenotype( v.getGenotype(pop1[i]) ) ) return false;
}
}

return true;
}
accept(variant);
``````
1
4.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

using my tool https://github.com/lindenb/jvarkit/wiki/VCFFilterJS and the following script:

``````function accept(v)
{
var pop1=["sample1","sample2","sample3"];
var pop2=["sample4","sample5"];
var i,j;
var g1 = v.getGenotype(pop1[0]);
var g2 = v.getGenotype(pop2[0]);

if( !g1.isCalled() ) return false;
if( !g2.isCalled() ) return false;
if( g1.sameGenotype(g2) ) return false;

for(i=1;i < pop1.length;++i)
{
if( ! g1.sameGenotype( v.getGenotype(pop1[i]) ) ) return false;
}
for(i=1;i < pop2.length;++i)
{
if( ! g2.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
return true;
}

accept(variant);
``````

you can then reformat the resulting vcf using my tool bioalcidae https://github.com/lindenb/jvarkit/wiki/BioAlcidae

Hi Pierre,

Thanks for the help.

The script still missed condition where 2 opposite genotype are in Population B(a1-a4) and vice versa like as exampled as below which I'm interested to keep (even after I set `pop1=["s1","s2","s3","s4"]; pop2=["a1","a2","a3","a4"]`):

``````s1      s2     s3    s4     a1    a2     a3     a4
1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1
1/1    1/1    1/1    1/1    0/0    0/1    0/0    0/1
``````

and so on..

Also, what is the `j` refer to in the script (`var i,j`)?

1

The script still missed condition where 2 opposite genotype

oh I see, It would be something like

``````function accept(v)
{
var pop1=["sample1","sample2","sample3"];
var pop2=["sample4","sample5"];
var i;
var g1 = v.getGenotype(pop1[0]);

if( !g1.isCalled() ) return false;

for(i=1;i < pop1.length;++i)
{
if( ! g1.sameGenotype( v.getGenotype(pop1[i]) ) ) return false;
}
for(i=0;i < pop2.length;++i)
{
if( g1.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
return true;
}

accept(variant);
``````

The script worked brilliantly to filter for fix genotype in 1 population (eg. 0/0) and opposite mix genotype (0/1 or 1/1 and mix of 0/1 & 1/1) in other population.

My ultimate aim is to have script that will output following within single script :

``````s1      s2     s3    s4     a1    a2     a3     a4
1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1
1/1    1/1    1/1    1/1    0/1    0/0    0/0    0/1
0/0    0/0    0/0    0/1     1/1    1/1    1/1    1/1
0/0    0/1    0/0    0/1     1/1    1/1    1/1    1/1
0/1    0/1    0/1    0/1     1/1    1/1    1/1    0/0
1/1    1/1    1/1    0/0      0/1    0/1    0/1    0/1
``````

and so on ..

I've tried to modified the script but failed. Can you advise?

Thanks

j is unused. you can remove it.

0
4.9 years ago by
DG7.2k
DG7.2k wrote:

Have you checked out GEMINI? It does a little more than you you want in that it also adds all sorts of annotations to your dataset as well, but it stores the VCFs (and can include a PED file) in an SQLite database. It has some built in search tools where I think you could set up the right queries to get what you want. Since you can separate your populations as defined groups or cohorts in the database.

Thanks Dan for the info,

I work with non-human and GEMINI is only for human.

No problem, there may be a way to "hack" it to work since you aren't really interested in the annotations anyway, just the genotype storage and selection. But it may be easier to take another approach.

0
4.9 years ago by
DG7.2k
DG7.2k wrote:

Looking around I think the tool you probably want to use is vcf-contrast from the VCFtools tool set. It is specifically designed for looking for differences between groups in this manner.

vcf-constrast still missed condition where 2 opposite genotype are present in Pop.B and vice versa as following:

``````s1      s2     s3    s4     a1    a2     a3     a4
1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1
``````