Question

How to obtain genomic coordinates of point mutations given in protein/amino acid notation

0

Entering edit mode

6.0 years ago

Friederike 8.9k

Hi,

I have a list of mutations using the protein sequence as a reference, e.g. JAK1 G1097D. I would like to obtain the genome coordinates of the actual base pairs encoding the affected codon. Ideally, all I have to do is supply the gene identifier (including the organism) and the mutation and in return I get the genome coordinates.

I can think up a couple of ways to address this problem, but I'm sure there must be a solution out there already -- my web searching has failed me so far, so please do share your bookmarks!

Thanks!

sequence proteins genes mutations • 3.2k views

ADD COMMENT • link 6.0 years ago by Friederike 8.9k

2

Entering edit mode

6.0 years ago

Friederike 8.9k

Just for future searches: I finally found this old Biostars post that addressed exactly my problem. The solution that worked best for me was to use TransVar -- it also helped me figure out which input format it needed by returning a somewhat sensible error message as long as it didn't find anything.

ADD COMMENT • link 6.0 years ago by Friederike 8.9k

score 2 · Accepted Answer · 2018-04-30

2

Entering edit mode

6.0 years ago

Emily 23k

You can put them into the Ensembl VEP. You can use HGVS as your input and it will give you genome coordinates (amongst other things), in the output.

ADD COMMENT • link 6.0 years ago by Emily 23k

0

Entering edit mode

thanks for the hint! I seem to struggle with the input though, e.g. none of these have produced any results for the above mentioned mutation of JAK1. I've tried all combinations of ENST* or ENSG* identifiers, the notation of the mutation with 1097Gly>Asp, which seemed to me the one in the examples, as well as Gly1097Asp, which seemed to be the one outlined at the SVN page. Using the single-letter notations for the amino acids also didn't work.

ENST00000342505:p.1097Gly>Asp
ENSG00000162434:p.Gly1097Asp

I've tried this via the website using the default settings. Any additional hints?

ADD REPLY • link 6.0 years ago by Friederike 8.9k

1

Entering edit mode

HGVS notation for protein is :p.3-letter_amino_acid+position+3-letter_amino_acid, ie :p.Gly1097Asp. To use this you need a protein ID (eg NP or ENSP) or a protein name.

The following work:

JAK1:p.Gly1097Asp

ENSP00000343204:p.Gly1097Asp

If you use a gene ID it doesn't work at all, if you're using a transcript ID it expects CDS coordinates.

ADD REPLY • link 6.0 years ago by Emily 23k

0

Entering edit mode

of course, using an actual protein ID makes sense! :) thanks for pointing that out!

ADD REPLY • link 6.0 years ago by Friederike 8.9k