Statistical analysis of a whole exome VCF file
1
0
Entering edit mode
6.5 years ago

I have a VCF file for a whole-exome sequence dataset generated by the agilent 1.1 capture kit.

The genome coordinates are GRCh37.

If I wanted to a case-control burden test on every gene in the dataset what steps would I need to follow?

  • how do I get a complete and unique list of genes to run the test on?
  • how do I subset just the variants in the exons of these genes?
gene VCF whole-exome • 1.9k views
ADD COMMENT
0
Entering edit mode
6.5 years ago

Below is some code I used a while ago to do something similar, but not on a whole exome scale.

java -Xmx4g -jar $SNPEFF -c $SNPEFF_config GRCh37.75 yourfile.vcf > yourfile_annotated.vcf

vcftools --vcf yourfile_annotated.vcf --get-INFO ANN --stdout | \
  cut -f5 | tr ',' '\n' | cut -f4 -d'|' | grep -v -w 'ANN' |  sort -u | \
  parallel -j6 'cat <(grep '^#' yourfile_annotated.vcf) <(grep -v '^#' yourfile_annotated.vcf | grep -w {}) > {}.split.vcf'

I hope this can put you in the right direction.

ADD COMMENT

Login before adding your answer.

Traffic: 2524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6