Opposite genotype filtering in multisample vcf for 2 population
3
0
Entering edit mode
6.3 years ago
ck791234 • 0

Hi,

I'm trying to filter for opposite genotype from multisample vcf (8 samples) which consist of 2 population (4 samples/population) while maintain the vcf format.

What i want to achieved are as follow:

1. Filter for same genotype in all sample in Population A while genotype in Population B are totally different than Population A.

example: 1/1 in all sample in Population A (s1-s4), none 1/1 in Population B (a1-a4) (genotype in Population B can be a mixture of 0/0 and 0/1).

#desire genotype output

s1      s2     s3    s4     a1    a2     a3     a4

1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/0

1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1

and so on..

2. Flip the genotype filtering as in (1) where genotype in Population B are same while genotype in Population A is not same as Population B.

#desire genotype output

s1      s2     s3    s4     a1    a2     a3     a4

0/0    0/0    0/0    0/0    1/1    1/1    1/1    1/1

0/0    0/0    0/0    0/1    1/1    1/1    1/1    1/1

and so on..

I tried with SnpSift, however the filter missed 26 combination when involve 1/1 genotype.

The command i used as follow:

cat sample.vcf | java -Xmx4g -jar  SnpSift.jar filter  "(((countHom()>3)&(countHom()<5))|(countHet()=4)|(countRef()=4)) &((isRef(GEN[0]) & isRef(GEN[1]) & isRef(GEN[2]) &  isRef(GEN[3]))|(isHom(GEN[0]) & isVariant(GEN[0]) & isHom(GEN[1])& isVariant(GEN[1]) & isHom(GEN[2]) & isVariant(GEN[2])& isHom(GEN[3]) & isVariant(GEN[3])) | (isHet(GEN[0]) & isHet(GEN[1]) & isHet(GEN[2]) & isHet(GEN[3]))| (isRef(GEN[4]) & isRef(GEN[5]) & isRef(GEN[6]) &  isRef(GEN[7]))|(isHom(GEN[4]) & isVariant(GEN[4]) & isHom(GEN[5])& isVariant(GEN[5]) & isHom(GEN[6]) & isVariant(GEN[6])& isHom(GEN[7]) & isVariant(GEN[7])) | (isHet(GEN[4]) & isHet(GEN[5]) & isHet(GEN[6]) & isHet(GEN[7])))"

Does anyone has experience in this filtering?  I  appreciate any help/advise.

Many thanks

genotype vcf SNP snpsift • 2.1k views
0
Entering edit mode

Thanks for all the helps & advise.

Script for the opposite filtering with https://github.com/lindenb/jvarkit/wiki/VCFFilterJS:

function accept(v)
{
var pop1=["s1","s2","s3","s4"];
var pop2=["a1","a2","a3","a4"];
var i;
var g1 = v.getGenotype(pop1[0]);
var g2 = v.getGenotype(pop2[0]);
var g1issame = 1;

if( !g1.isCalled() ) return false;
if( !g2.isCalled() ) return false;

for(i=1;i < pop1.length; ++i)
{
if( ! g1.sameGenotype( v.getGenotype(pop1[i]) ) ) g1issame = 0;
}

if( g1issame == 1 )
{
for(i=0;i < pop2.length; ++i)
{
if( g1.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
}
else
{
for(i=1;i < pop2.length;++i)
{
if( ! g2.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
for(i=0;i< pop1.length; ++i)
{
if( g2.sameGenotype( v.getGenotype(pop1[i]) ) ) return false;
}
}

return true;
}
accept(variant);

1
Entering edit mode
6.3 years ago

using my tool https://github.com/lindenb/jvarkit/wiki/VCFFilterJS and the following script:

function accept(v)
{
var pop1=["sample1","sample2","sample3"];
var pop2=["sample4","sample5"];
var i,j;
var g1 = v.getGenotype(pop1[0]);
var g2 = v.getGenotype(pop2[0]);

if( !g1.isCalled() ) return false;
if( !g2.isCalled() ) return false;
if( g1.sameGenotype(g2) ) return false;

for(i=1;i < pop1.length;++i)
{
if( ! g1.sameGenotype( v.getGenotype(pop1[i]) ) ) return false;
}
for(i=1;i < pop2.length;++i)
{
if( ! g2.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
return true;
}

accept(variant);


you can then reformat the resulting vcf using my tool bioalcidae https://github.com/lindenb/jvarkit/wiki/BioAlcidae

0
Entering edit mode

Hi Pierre,

Thanks for the help.

The script still missed condition where 2 opposite genotype are in Population B(a1-a4) and vice versa like as exampled as below which I'm interested to keep (even after I set pop1=["s1","s2","s3","s4"]; pop2=["a1","a2","a3","a4"]):

s1      s2     s3    s4     a1    a2     a3     a4
1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1
1/1    1/1    1/1    1/1    0/0    0/1    0/0    0/1


and so on..

Also, what is the j refer to in the script (var i,j)?

1
Entering edit mode

The script still missed condition where 2 opposite genotype

oh I see, It would be something like

function accept(v)
{
var pop1=["sample1","sample2","sample3"];
var pop2=["sample4","sample5"];
var i;
var g1 = v.getGenotype(pop1[0]);

if( !g1.isCalled() ) return false;

for(i=1;i < pop1.length;++i)
{
if( ! g1.sameGenotype( v.getGenotype(pop1[i]) ) ) return false;
}
for(i=0;i < pop2.length;++i)
{
if( g1.sameGenotype( v.getGenotype(pop2[i]) ) ) return false;
}
return true;
}

accept(variant);

0
Entering edit mode

The script worked brilliantly to filter for fix genotype in 1 population (eg. 0/0) and opposite mix genotype (0/1 or 1/1 and mix of 0/1 & 1/1) in other population.

My ultimate aim is to have script that will output following within single script :

s1      s2     s3    s4     a1    a2     a3     a4
1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1
1/1    1/1    1/1    1/1    0/1    0/0    0/0    0/1
0/0    0/0    0/0    0/1     1/1    1/1    1/1    1/1
0/0    0/1    0/0    0/1     1/1    1/1    1/1    1/1
0/1    0/1    0/1    0/1     1/1    1/1    1/1    0/0
1/1    1/1    1/1    0/0      0/1    0/1    0/1    0/1


and so on ..

I've tried to modified the script but failed. Can you advise?

Thanks

0
Entering edit mode

j is unused. you can remove it.

0
Entering edit mode
6.3 years ago
DG 7.2k

Have you checked out GEMINI? It does a little more than you you want in that it also adds all sorts of annotations to your dataset as well, but it stores the VCFs (and can include a PED file) in an SQLite database. It has some built in search tools where I think you could set up the right queries to get what you want. Since you can separate your populations as defined groups or cohorts in the database.

0
Entering edit mode

Thanks Dan for the info,

I work with non-human and GEMINI is only for human.

0
Entering edit mode

No problem, there may be a way to "hack" it to work since you aren't really interested in the annotations anyway, just the genotype storage and selection. But it may be easier to take another approach.

0
Entering edit mode
6.3 years ago
DG 7.2k

Looking around I think the tool you probably want to use is vcf-contrast from the VCFtools tool set. It is specifically designed for looking for differences between groups in this manner.

0
Entering edit mode

vcf-constrast still missed condition where 2 opposite genotype are present in Pop.B and vice versa as following:

s1      s2     s3    s4     a1    a2     a3     a4
1/1    1/1    1/1    1/1    0/0    0/0    0/0    0/1