Question

Statistical analysis of a whole exome VCF file

0

Entering edit mode

6.5 years ago

adam.waring • 0

I have a VCF file for a whole-exome sequence dataset generated by the agilent 1.1 capture kit.

The genome coordinates are GRCh37.

If I wanted to a case-control burden test on every gene in the dataset what steps would I need to follow?

how do I get a complete and unique list of genes to run the test on?
how do I subset just the variants in the exons of these genes?

gene VCF whole-exome • 1.9k views

ADD COMMENT • link updated 6.5 years ago by WouterDeCoster 47k • written 6.5 years ago by adam.waring • 0

score 0 · Answer 1 · 2017-10-10

Below is some code I used a while ago to do something similar, but not on a whole exome scale.

java -Xmx4g -jar $SNPEFF -c $SNPEFF_config GRCh37.75 yourfile.vcf > yourfile_annotated.vcf

vcftools --vcf yourfile_annotated.vcf --get-INFO ANN --stdout | \
  cut -f5 | tr ',' '\n' | cut -f4 -d'|' | grep -v -w 'ANN' |  sort -u | \
  parallel -j6 'cat <(grep '^#' yourfile_annotated.vcf) <(grep -v '^#' yourfile_annotated.vcf | grep -w {}) > {}.split.vcf'

I hope this can put you in the right direction.