Question: Regenotype vcf using some samples as reference
0
gravatar for rmf
19 months ago by
rmf700
rmf700 wrote:

I have a VCF file with 24 control samples and 24 treated sample all called jointly through GATK. I am not interested in the differences between all my samples vs ensembl reference used for mapping. I am interested in control vs treated. So I would like to use my control samples as reference and regenotype my treated samples against it.

To explain further, here is a simplified example:

var ref alt t1   t2   t3   c1   c2   c3
1   A   T   A/T  A/T  A/T  A/T  A/T  A/T
2   A   T   A/T  A/T  A/T  A/A  A/A  A/A
3   G   C   G/G  G/G  G/G  G/C  G/C  G/C

var1 would not be interesting as it does not differ between controls and treated. var2 and var3 are interesting.

I am thinking of an approach where I would pick variants that have identical genotypes across controls and find a consensus. Then use that as the reference and regenotype my treated samples against it. Now that brings us to an interesting question... Which allele do I pick as reference for heterozygous positions. Now sure how they do that for all the reference genomes...

In this example, I am just going to pick the most common allele for each variant and set that as the reference. var1 was skipped earlier and the new reference looks like:

var ref
2   A
3   G

Now if we regenotype the treated samples against the new reference, we get:

var ref alt c1   c2   c3
2   A   A   A/A  A/A  A/A
3   G   C   G/C  G/C  G/C

var 2 can be skipped because it is not polymorphic anymore (only because A was chosen as ref). Then we have:

var ref alt c1   c2   c3
3   G   C   G/C  G/C  G/C

Is this correct? Anything like this implemented in any workflow/software? Ultimately the aim is to pick only differences caused due to the experimental condition.

snp variant-calling vcf • 759 views
ADD COMMENTlink modified 19 months ago • written 19 months ago by rmf700

not clear to me: what should you do with your t* samples ?

ADD REPLYlink written 19 months ago by Pierre Lindenbaum122k

From original

I would pick variants that have identical genotypes across controls and find a consensus. Then use that as the reference and regenotype my treated samples against it.

ADD REPLYlink written 19 months ago by genomax70k

Well, the point is that I have 3 variants when considering all samples (top code block), but I have only 1 variant if I regenotype my T against C (bottom code block). I would expect this to make a big difference in downstream variant effect predictions and so on.

ADD REPLYlink modified 19 months ago • written 19 months ago by rmf700
1
gravatar for rmf
19 months ago by
rmf700
rmf700 wrote:

I used PLINK to do the association between control/treated and find significant sites. I then filtered my VCF based on those sites.

ADD COMMENTlink written 19 months ago by rmf700
0
gravatar for pfs
19 months ago by
pfs270
USA/Boston
pfs270 wrote:

If you care about comparing the case vs controls then you still have two differences. The C's are homozygous Major while the T's are heterozygous for var 2, and the opposite for var 3. The major and minor reference allele do not always result in in optimal/not optimal gene expression. You should take the genotype calls as is and annotate the variants using an annotation program to determine effect

ADD COMMENTlink written 19 months ago by pfs270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1785 users visited in the last hour