Question

How to annotate a VCF with Entrez Gene IDs

0

Entering edit mode

8.9 years ago

Ward Weistra ▴ 220

Dear Biostars,

I would like to annotate my VCF with Entrez Gene IDs. I have found ways to add the HGNC Gene Symbol and the Ensemble Gene ID (VEP, Annovar), but not directly to Entrez Gene IDs. I prefer not to translate from the HGNC or Ensemble to Entrez, because I'm afraid information gets lost with this extra translation.

Maybe a BED file with all Entrez Gene IDs might help, since I've found tools to merge annotate VCF files via BED files in Galaxy. Maybe I'm just using the wrong term for Entrez Gene IDs. I mean, for example, the 7157 in http://www.ncbi.nlm.nih.gov/gene/7157.

Thanks in advance,
Ward

ncbi vcf entrez • 4.5k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Ward Weistra ▴ 220

0

Entering edit mode

@wardweistra sorry to comment with a question. I kinder spend all day reading up on variant calling and how to get a causative gene(s) from vcf files. By causative gene I mean the gene that causes a particular phenotype. At this stage I'm trying just to understand the lingo. So by annotating VCF you mean that all SNP (variants) will be assign to a gene (or other feature)? If that's the case will that be a new file or annotation can be held in vcf file? any help is much appreciated.

p.s I don't know how help this might be, but this tool snpSift seems to do annotation http://snpeff.sourceforge.net/SnpSift.html

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

8.9 years ago

Pierre Lindenbaum 161k

I wrote a tool to annotate a vcf from another indexed vcf. https://github.com/lindenb/jvarkit/wiki/VcfPeekVcf

For example, to annotate a 1Kg with the VCF from NCBI/dbsnp: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/ , we peek the INFO named GENEINFO (and we add a NCBI_VCF_ prefix)

$  curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" | \
  gunzip -c | \
  java -jar dist/vcfpeekvcf.jar -f ncbi/snp/organisms/human_9606/VCF/00-All.vcf.gz -t GENEINFO -p NCBI_VCF_ | \
  cut -f 1-8 | grep NCBI_VCF_GENEINFO | head

##INFO=<ID=NCBI_VCF_GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
22    16260678    rs5746333    G    A    100    PASS    AA=G|||;AC=3244;AF=0.647764;AFR_AF=0.3888;AMR_AF=0.5634;AN=5008;DP=8520;EAS_AF=0.9673;EUR_AF=0.6133;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.7638;VT=SNP
22    16264717    rs148113506    TA    T    100    PASS    AA=A|A|-|deletion;AC=2066;AF=0.41254;AFR_AF=0.3858;AMR_AF=0.4265;AN=5008;DP=53564;EAS_AF=0.4196;EUR_AF=0.4274;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.4162;VT=INDEL
22    16265110    rs2212121    C    T    100    PASS    AA=C|||;AC=416;AF=0.0830671;AFR_AF=0.0045;AMR_AF=0.1744;AN=5008;DP=22443;EAS_AF=0.1667;EUR_AF=0.0219;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.1012;VT=SNP
22    16267558    rs2010682    T    C    100    PASS    AA=C|||;AC=4111;AF=0.820887;AFR_AF=0.8434;AMR_AF=0.6758;AN=5008;DP=10404;EAS_AF=0.9762;EUR_AF=0.7097;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.8476;VT=SNP
22    16269466    rs2212127    T    C    100    PASS    AA=C|||;AC=3668;AF=0.732428;AFR_AF=0.6641;AMR_AF=0.6066;AN=5008;DP=2535;EAS_AF=0.9712;EUR_AF=0.6262;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.7771;VT=SNP
22    16269829    rs114833654    T    A    100    PASS    AA=A|||;AC=4085;AF=0.815695;AFR_AF=0.7186;AMR_AF=0.768;AN=5008;DP=7907;EAS_AF=0.9484;EUR_AF=0.7992;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.8609;VT=SNP
22    16277622    rs2845217    G    A    100    PASS    AA=A|||;AC=2911;AF=0.58127;AFR_AF=0.3298;AMR_AF=0.5216;AN=5008;DP=5436;EAS_AF=0.9167;EUR_AF=0.5467;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0.6534;VT=SNP
22    16285169    rs192723103    T    G    100    PASS    AA=T|||;AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0.0014;AN=5008;DP=23204;EAS_AF=0;EUR_AF=0;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0;VT=SNP
22    16285178    rs184299536    G    C    100    PASS    AA=G|||;AC=1;AF=0.000199681;AFR_AF=0.0008;AMR_AF=0;AN=5008;DP=23166;EAS_AF=0;EUR_AF=0;NCBI_VCF_GENEINFO=POTEH:23784;NS=2504;SAS_AF=0;VT=SNP

See also: snpsift http://snpeff.sourceforge.net/SnpSift.html#annotate

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Pierre Lindenbaum 161k

Ram · Accepted Answer · 2015-05-29

If you use the RefSeq transcript set with VEP, you get the Entrez gene IDs in the Gene column of the output:

echo "17 7673573 . T C" | perl variant_effect_predictor.pl -refseq -database -force -o stdout -fields Gene,Feature,Consequence | grep -v ##
#Gene   Feature Consequence
7157    NM_001126115.1  missense_variant
7157    NM_001276696.1  missense_variant
7157    NM_001276697.1  missense_variant
7157    NM_001126113.2  missense_variant
7157    NM_001126118.1  missense_variant
7157    NM_001276699.1  missense_variant
7157    NM_000546.5     missense_variant
7157    NM_001276760.1  missense_variant
7157    NM_001126114.2  missense_variant
7157    NM_001276695.1  missense_variant
7157    NM_001276761.1  missense_variant
7157    NM_001126117.1  missense_variant
7157    NM_001276698.1  missense_variant
7157    NM_001126112.2  missense_variant
7157    NM_001126116.1  missense_variant

You can do the same in the VEP web interface by selecting the relevant transcript set when you submit your job.