Question

vcf filtering and query

0

Entering edit mode

9.0 years ago

joaofadista ▴ 50

Hi,

I would like to know how can I handle this query to a vcf file: I have 2 siblings for each 500 mothers (1500 samples in total). They are all genome-wide genotyped and stored in a vcf file. What I want is to create 3 vcf files from the original one:

One where the genotypes for each variant in the siblings are kept only if their mother is homozygous reference. The sites where their mother is not homozygous reference make it as missing genotype in the respective siblings (vcf1)
One where the genotypes for each variant in the siblings are kept only if their mother is homozygous alternative. The sites where their mother is not homozygous alternative make it as missing genotype in the respective siblings (vcf2)
One where the genotypes for each variant in the siblings are kept only if their mother is heterozygous. The sites where their mother is not heterozygous make it as missing genotype in the respective siblings (vcf3)

Example of desired vcf1:

       Sa1  Sa2  Ma   Sb1  Sb2  Mb
rs001  0/1  0/1  0/0  ./.  ./   0/1
rs002  ./.  ./.  0/1  0/0  0/1  0/0

S = sibling

M = mother

Thanks in advance,
João

vcf filter query • 2.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.0 years ago by joaofadista ▴ 50

0

Entering edit mode

All 1.5k samples are in one file?

ADD REPLY • link 9.0 years ago by Dan ▴ 540

0

Entering edit mode

If all your samples are in the same vcf file, you might need to write your own script for such job.

What do you want to achieve by separating the files? Maybe there are better ways to do what you want to do without generating all those files?

ADD REPLY • link 9.0 years ago by Sam ★ 4.8k

0

Entering edit mode

Thanks for the replies. Yes, all the 1.5k samples are in one file vcf file. I will try to write a script then...

I will explain my problem. I have for each of the 500 mothers, 2 siblings one affected and one not affected by a disease trait (they might be full or only half sibs). Since I have genome-wide data on both siblings and respective mothers, I would like to find mother-child genotype pairs that affect the risk of a child developing this disease. Initially I thought to do a GWAS on the siblings (correcting for relatedness), stratified by the genotype of the mothers. Do you recommend a better way to do this?

Thanks in advance.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.0 years ago by joaofadista ▴ 50

0

Entering edit mode

if you have the pedigree file, you can try to use plinkseq or kggseq to identify denovo mutations in your disease samples. A good direction to look at might be tools designed to detect somatic mutations in cancer. The goal should be similar, just the sample is slightly different.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 9.0 years ago by Sam ★ 4.8k