Vcf File --> Gene/Protein Mutation Information
Entering edit mode
8.4 years ago
newDNASeqer ▴ 730

I used AnnoVar to annotate the VCF files, and it seems to me that the VCF files contain mutation site that only has chromosomal loci/position. I was wondering if there's any tool to transform these mutation information to gene/protein that has the mutated amino acid. For example, I want the final results to show mutations with the corresponding gene names (e.g. p53: Thr 200 --> Glu mutation)

What program can do this with the annotated vcf? THANKS a lot

vcf mutation annovar • 10k views
Entering edit mode
8.4 years ago

If you have a sorted BED file containing genes and their coordinates, you can use BEDOPS bedmap to map positions in a VCF file to those genes.

The overall process is:

  1. Convert the VCF file to BED format with the vcf2bed script
  2. Get genes from somewhere and put them into BED format
  3. Map the converted file to the genes


$ vcf2bed < positions.vcf > positions.bed

Obtaining Genes

Now grab some genes and turn them into a BED file. Let's say you are working with mouse (build mm10). We'll take Ensembl gene annotations:

$ wget -O - \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > mm10.genes.bed

You'd modify this step as needed for whichever organism and build you are working with. We use gtf2bed here to convert the GTF-formatted Ensembl records to sorted BED, and we then filter the output specifically for genes.


We have two sorted, BED-formatted inputs: the VCF-sourced positions and our gene annotations. We're now ready to use bedmap to do mapping:

$ bedmap --echo --echo-map-id positions.bed mm10.genes.bed > answer.bed

The file answer.bed is a sorted BED file that contains the mutation positions and a semi-colon-delimited list of gene IDs which overlap the position:

[ mutation-1 ] | [ gene-1-1 ] ; ... ; [ gene-1-i ]
[ mutation-2 ] | [ gene-2-1 ] ; ... ; [ gene-2-j ]
[ mutation-N ] | [ gene-N-1 ] ; ... ; [ gene-N-k ]

Where no genes overlap a mutation, nothing is printed. The default overlap parameter is one or more bases. You can make this setting more stringent by adjusting overlap criteria.

If you want the entire gene record — and not just the ID — use --echo-map in place of --echo-map-id. Other mapping --echo-* options are available, depending on what you want bedmap to report.

Entering edit mode

I keep running into your answers now that I'm working with variants again... hello! :D

Entering edit mode

Hey, congrats on the new job! Hope you're well.

Entering edit mode
8.4 years ago

Please take another look at the on-line Annovar documentation, especially regarding gene-based annotation. Also the quick startup guide. Annovar certainly can output cDNA position info and the amino acid substitution. You just have to ask for it. What command did you run to get your output?

Entering edit mode
8.4 years ago

You can also give data in vcf format to ensembl's Variant Effect Predictor, if the organism is in their database.

You can find that here:

It's a bit of a pain to parse; since each gene has multiple transcripts, it will output the same SNP multiple times, to account for its effect in every relevant transcript.

Entering edit mode
8.4 years ago
GPR ▴ 350

A while ago I was recommended the BEDOPs tools, which coupled to a RefFlat-Genes table, will append a gene name per chromosome location.


Login before adding your answer.

Traffic: 2641 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6