Convert amino acid and nucleotide coding changes between notation formats
1
0
Entering edit mode
7.1 years ago
steve ★ 3.5k

I have a large amount of annotated variant nucleotide coding change & amino acid change ID's in mixed formats like this:

c.973T>C

c.215C>A

p.Gly12Asp

c.516_523delTGTGAGGC

c.G1514A

p.S505N

Some of them came from ANNOVAR, others came from legacy software, now they have all been mixed together.

I would like to be able to convert them all easily from one format to another. Is there a program that can do this? For reference, the nucleotide conversions would look like this:

c.973T>C < - > c.T973C

c.215C>A < - > c.C215A

c.1514G>A < - > c.G1514A

Likewise for the amino acid change IDs, though I am not actually sure how the deletions would be handled for this. Google'ing did not turn up anything, though I probably haven't figure out the right keywords.

variant • 2.6k views
ADD COMMENT
0
Entering edit mode

Once you have them standardised, what would the next step be? These are the HGVS notations. Do you know which genes they map to? If you have something like 5:g.140532T>C or NM_153681.2:c.7C>T or ENST00000285667.3:c.1047_1048insC or NP_000020.1:p.Met268Thr, you can use the Variant Effect Predictor to get them all either as genomic coordinates or known IDs such as rsXXXXX.

ADD REPLY
0
Entering edit mode

I don't think it's necessary to use the variant effect predictor. I believe the hgvs python package is the somewhat official code to parse HGVS (http://hgvs.readthedocs.io/ ).

ADD REPLY
0
Entering edit mode

This is a very good tool, however, in order to use it mutations have to be in HGVS mutnomen standard and this might not be the case. Also, c.G1514A is not HGVS standard as far as I remember.

ADD REPLY
0
Entering edit mode

Yes I already have all the associated meta data (genes, transcripts, etc), but I'm trying to get all these IDs into a consistent format to standardize my dataset

ADD REPLY
0
Entering edit mode
7.1 years ago

There is a standard HGVS notation to report mutations using transcripts or protein sequences as a reference. Here is the old version http://www.hgvs.org/mutnomen/ and here is updated version http://varnomen.hgvs.org/ Your example codes are in line with at least one of these standards. Only this one bothers me:

c.G1514A

I do not remember this style as a standard notation.

For everything else, it is possible to write a script to make it look like HGVS standard, but it is going to be very hard to make sure that mutation description is actually correct HGVS notation because HGVS has certain rules for regions with repeats, for mutations in splice sites and so on. For example consider CTATATAG on the forward strand of DNA changed to CTATAG now the question is what to write in the notation: deletion of first TA, the second TA or the third TA? To the end user and in databases c.23_24delTA and c.27_28delTA look as different mutations but as in the example, the result of the mutation is the same. Thus HGVS notation has to provide a standard on what you select. Because of this, there are two options to consider from:

  1. look in your data if you have chromosome, position, reference and alternative allele data with a particular reference for each mutation and if reference assemblies are different, use liftover tool from UCSC to convert to a single one and use a tool that creates HGVS notations using reference transcripts from your metadata after
  2. write a tool that makes your mutation codes look like a proper HGVS notation (substitute single amino acid codes with three lettered ones, remove text after _del and so on).

The second approach might be ok, but it is not guaranteed to give you proper HGVS notation for the reasons described above, so I would go with the first option.

If you do not have original chromosome, position, reference and alternative allele data for mutations, then you can try to guestimate them, but this a not easy or even impossible like for mutations on a protein level.

Moreover, your notation is using certain versions of transcripts and proteins. If the data is old, most likely some of the sequences got updated and this can change the notation (a rare event).

ADD COMMENT

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6