Question

What are HGVS Names?

2

Entering edit mode

10.1 years ago

jcorroon ▴ 50

I am a clinician, and have no particular expertise in genomics. I'm confused by much of what I see on dbSNP. Any help would be greatly appreciated! Thank you in advance.

I need some help understanding HGVS Names!

I see SNP's referred to with conventions like A66G for MTRR (rs1801394) and A1298C for MTHFR (rs1801131).

When I look under HGVS names for rs1801394 I see: NM_002454.2:c.66A>G, which l assume equals "A66G", but there are many other "names" there. What are these names? Are they all equivalent?

When I look under HGVS names for rs1801131 I do not see anything resembling "A1298C". Does this mean I have the wrong rs# for this SNP?

According to SNPedia, it's the correct rs#: "rs1801131 is a SNP in the MTHFR gene, representing an A>C mutation at mRNA position 1298".

SNP • 13k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by jcorroon ▴ 50

0

Entering edit mode

Hi, I have a similar question, i have the coding HGVS C.5769delG and would like to turn it into a SNP id/FASTA format. How can i do so?

ADD REPLY • link 7.7 years ago by marierose.mina • 0

0

Entering edit mode

It's c.5769delG, the first c should not be capitalized. It's a cDNA change, a G deletion at position 5769. What you have here is just a partial ID, by the way. You need to know the transcript this variant is referring to. See this comment for an example: C: What are HGVS Names?

Once you have the transcript sequence, use any tool to remove the 5769th character (after you verify it's a G). You can use the python hgvs package to automate this.

ADD REPLY • link 7.7 years ago by Ram 45k

Ram · Accepted Answer · 2015-06-19

5

Entering edit mode

10.1 years ago

Ram 45k

I'll try to explain briefly, but it's a lot and it's wonderful - one of the rare parts where we have great standards.Check it out: http://www.hgvs.org/mutnomen/

The usual format is:

<REFERENCE_SEQUENCE_ID>:<SEQUENCE_TYPE>.<POSITION><CHANGE>

NM_002454.2:c.66A>Gis a cDNA change of Adenine to Guanine at position 66 in the cDNA ref sequence NM_002454 version 2.

A66G most commonly refers an amino acid change of Alanine to Glycine in some protein with some identifier (should ideally be NP_xxxxxxx:p.A66G). In your case though, I think you might be referring to a 66A>G nucleotide change as A66G, which is wrong.

Ref sequences can be g. (genomic), c. (cDNA), r. (RNA) or p. (protein)

Changes can be single amino acid variants (p.Ala66Gly) single nucleotide variants (A>G), deletions(c.10delA), insertions (c.10_11insT), duplications (c.10dupA), indels(c.10_11delACinsTGA) etc.

It is so well standardized that programs can parse and pick individual components for us to use.

ADD COMMENT • link 6.7 years ago by Ram 45k

0

Entering edit mode

Thanks for your help!

To confirm, the format "A66G" means something different than the format "66A>G"?

The A66G format represents an amino acid change, but the 66A>G format represents a base change?

What is the REFERENCE_SEQUENCE_ID?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by jcorroon ▴ 50

1

Entering edit mode

NM_002454.2:c.66A>G is a standardized coding HGVS identifier and is nucleotide-specific, where the REFERENCE_SEQUENCE_ID is the transcript NM_002454.2

66A>G alone could mean different things to different people

ADD REPLY • link updated 6.7 years ago by Ram 45k • written 10.1 years ago by Jeremy Leipzig 23k

0

Entering edit mode

There are 29 different HGVS names for rs1801131, many with different prefixes (e.g. NM, XP, NG, NC, etc.).

Are all the HGVS names equivalent (i.e. Describing the same SNP)?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by jcorroon ▴ 50

1

Entering edit mode

Each of those prefixes mean a different type of reference sequence. For more information on Accession Number prefixes: http://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

They do describe the same polymorphism, but on different reference sequences. They could be on different locations on different mRNAs (alternative splicing), and some of the mRNAs could be "predicted" (XM).

ADD REPLY • link 2.6 years ago by Ram 45k

Ram · Accepted Answer · 2015-06-19

2

Entering edit mode

10.1 years ago

Jeremy Leipzig 23k

the 1298 in A1298C must be relative to the mRNA, which includes the 5' UTR, or maybe just based an earlier prediction as to where the CDS began. I would imagine there is some historical reason it is still known by that name.

Here is one of many papers where they discuss c.665C>T and c.1286A>C: http://www.nature.com/gim/journal/v15/n2/full/gim2012165a.html

HGVS coding nomenclature starts from the CDS start, and is transcript specific, hence NM_005957.4:c.1286A>C is the HGVS identifier for this mutation. As long as you include the c-dot people will know what you mean.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by Jeremy Leipzig 23k

1

Entering edit mode

Oh, the pain of the UTRs, and the sequence history! This numbering can be a nightmare - I spent a month bringing various variants up to speed on current references and tagging them to their source reference sequences.

ADD REPLY • link 10.1 years ago by Ram 45k

0

Entering edit mode

Thanks Jeremy. When identifying SNPs in this way in the future, should I always use the HGVS identifier with the "NM" prefix? Also, for the other example above (rs1801394) there are 3 HGVS names with "NM" prefixes. What would be the correct way of identifying this SNP?

Thanks SO much!

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by jcorroon ▴ 50

2

Entering edit mode

Ideally, you should. When you use NM_000123.2:, your naming is valid across space (all possible reference sequences for that gene) and time (all existing/future versions of that specific reference sequence). In other words, you will be really specific and this will help people that refer to your work in the future.

If you are working on the same reference sequence (say, across the entire project/manuscript), you could always mention the ID on top/in the Methods section and just go with c.66A>G everywhere else. This way, you might minimize typing effort.

ADD REPLY • link 2.6 years ago by Ram 45k

1

Entering edit mode

Thanks Ram. What is ideal if there are multiple reference sequence IDs for the same polymorphism? Include all of the reference sequence IDs in the Methods section, and then refer to all of the types, positions and changes elsewhere (i.e. c.66A>G, c.147A>G, c.-1995T>C)? When referring to a polymorphism, why is it ideal to use a reference sequence for a cDNA molecule (i.e. Using the NM accession prefix to identify the reference sequence) when you could use one for a genomic DNA molecule (i.e. Using AC or NC or NM)? Is one preferable to the other if you have reference sequences for both types of molecule?

NM_002454.2:c.66A>G
NM_024010.2:c.147A>G
NM_024091.3:c.-1995T>C

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 10.1 years ago by jcorroon ▴ 50

1

Entering edit mode

Ideally, if there are multiple sequences, you pick the major transcript. This transcript usually is either associated best with the known function of the gene, expressed more ubiquitously or includes all exons (most often these factors coincide).

GenBank should give you a clear picture on this major transcript.

Ideally, you should specify both c. and g. co-ordinates. The catch is, g. can change over time but c. usually is stable, in my experience. But with current technologies, I doubt even g. will change. To cover all bases, specify the actual reference sequence (hg19/GRCh37 or hg38/GRCh38) and the chromosomal co-ordinates. The good part is, once you specify these once, say, in a table, you can assign a temporary identifier across the manuscript and refer to it with that identifier.

I know I'm asking for too much, but I've been mining a ton of papers for information, and most of them did not future-proof their references - not that it is their fault, they did the best at their time.

ADD REPLY • link 2.6 years ago by Ram 45k

0

Entering edit mode

I know that this may see as an outdated answer but I would like to share this too, about which nomenclature is recommended on manuscripts.

I agree with all the answers. In some journals as in the American Journal of Human Genetics, they specify the way they want the author to refer to mutations/variants. See here in the https://www.cell.com/ajhg/authors "Text specifications" part. From the experience in my lab, my adviser tends to refer to polymorphisms with the genetic and protein level nomenclature and also the "rs" identifier. So, it looks something like Gene XXX rsXXXXX c.YYXX>XX p.ZZX>X. As you can see, in the common journals the lab publish (Pharmacogenomics journal, Frontiers in Pharmacogenetics and Pharmacogenomics), they specify the nomenclature and the RefSeq ID is not needed on both journals.

ADD REPLY • link 6.7 years ago by antonybcampos • 0