Question: Statistical analysis of a whole exome VCF file
0
gravatar for adam.waring
3.0 years ago by
adam.waring0 wrote:

I have a VCF file for a whole-exome sequence dataset generated by the agilent 1.1 capture kit.

The genome coordinates are GRCh37.

If I wanted to a case-control burden test on every gene in the dataset what steps would I need to follow?

  • how do I get a complete and unique list of genes to run the test on?
  • how do I subset just the variants in the exons of these genes?
whole-exome vcf gene • 1.2k views
ADD COMMENTlink modified 3.0 years ago by WouterDeCoster44k • written 3.0 years ago by adam.waring0
0
gravatar for WouterDeCoster
3.0 years ago by
Belgium
WouterDeCoster44k wrote:

Below is some code I used a while ago to do something similar, but not on a whole exome scale.

java -Xmx4g -jar $SNPEFF -c $SNPEFF_config GRCh37.75 yourfile.vcf > yourfile_annotated.vcf

vcftools --vcf yourfile_annotated.vcf --get-INFO ANN --stdout | \
  cut -f5 | tr ',' '\n' | cut -f4 -d'|' | grep -v -w 'ANN' |  sort -u | \
  parallel -j6 'cat <(grep '^#' yourfile_annotated.vcf) <(grep -v '^#' yourfile_annotated.vcf | grep -w {}) > {}.split.vcf'

I hope this can put you in the right direction.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1873 users visited in the last hour