vcf filtering and query
1
0
Entering edit mode
6.7 years ago

Hi,

I would like to know how can I handle this query to a vcf file: I have 2 siblings for each 500 mothers (1500 samples in total). They are all genome-wide genotyped and stored in a vcf file. What I want is to create 3 vcf files from the original one:

• One where the genotypes for each variant in the siblings are kept only if their mother is homozygous reference. The sites where their mother is not homozygous reference make it as missing genotype in the respective siblings (vcf1)
• One where the genotypes for each variant in the siblings are kept only if their mother is homozygous alternative. The sites where their mother is not homozygous alternative make it as missing genotype in the respective siblings (vcf2)
• One where the genotypes for each variant in the siblings are kept only if their mother is heterozygous. The sites where their mother is not heterozygous make it as missing genotype in the respective siblings (vcf3)

Example of desired vcf1:

       Sa1  Sa2  Ma   Sb1  Sb2  Mb
rs001  0/1  0/1  0/0  ./.  ./   0/1
rs002  ./.  ./.  0/1  0/0  0/1  0/0


S = sibling

M = mother

João

vcf filter query • 2.2k views
0
Entering edit mode

All 1.5k samples are in one file?

0
Entering edit mode

If all your samples are in the same vcf file, you might need to write your own script for such job.

What do you want to achieve by separating the files? Maybe there are better ways to do what you want to do without generating all those files?

0
Entering edit mode

Thanks for the replies. Yes, all the 1.5k samples are in one file vcf file. I will try to write a script then...

I will explain my problem. I have for each of the 500 mothers, 2 siblings one affected and one not affected by a disease trait (they might be full or only half sibs). Since I have genome-wide data on both siblings and respective mothers, I would like to find mother-child genotype pairs that affect the risk of a child developing this disease. Initially I thought to do a GWAS on the siblings (correcting for relatedness), stratified by the genotype of the mothers. Do you recommend a better way to do this?

0
Entering edit mode

if you have the pedigree file, you can try to use plinkseq or kggseq to identify denovo mutations in your disease samples. A good direction to look at might be tools designed to detect somatic mutations in cancer. The goal should be similar, just the sample is slightly different.

0
Entering edit mode
6.7 years ago
vassialk ▴ 200

Nextgene software is easy for VCFs and reports or try VCFMiner and VCFtools, Ugene has a better viewer though many other functions are weak.