Question: What are HGVS Names?
2
gravatar for jcorroon
2.4 years ago by
jcorroon30
jcorroon30 wrote:

I am a clinician, and have no particular expertise in genomics. I'm confused by much of what I see on dbSNP. Any help would be greatly appreciated! Thank you in advance.

I need some help understanding HGVS Names!

I see SNP's referred to with conventions like "A66G" for MTRR (rs1801394) and "A1298C" for MTHFR (rs1801131).

When I look under HGVS names for rs1801394 I see: NM_002454.2:c.66A>G, which l assume equals "A66G", but there are many other "names" there. What are these names? Are they all equivalent? 

When I look under HGVS names for rs1801131 I do not see anything resembling "A1298C". Does this mean I have the wrong rs# for this SNP? 

According to SNPedia, it's the correct rs#: "rs1801131 is a SNP in the MTHFR gene, representing an A>C mutation at mRNA position 1298".

snp • 4.2k views
ADD COMMENTlink modified 2.4 years ago by Jeremy Leipzig17k • written 2.4 years ago by jcorroon30

Hi, I have a similar question, i have the coding HGVS C.5769delG and would like to turn it into a SNP id/FASTA format. How can i do so?

ADD REPLYlink written 7 days ago by marierose.mina0

It's c.5769delG, the first c should not be capitalized. It's a cDNA change, a G deletion at position 5769. What you have here is just a partial ID, by the way. You need to know the transcript this variant is referring to. See this comment for an example: C: What are HGVS Names?

Once you have the transcript sequence, use any tool to remove the 5769th character (after you verify it's a G). You can use the python hgvs package to automate this.

ADD REPLYlink modified 6 days ago • written 6 days ago by Ram12k
4
gravatar for Ram
2.4 years ago by
Ram12k
New York
Ram12k wrote:

I'll try to explain briefly, but it's a lot and it's wonderful - one of the rare parts where we have great standards.Check it out: http://www.hgvs.org/mutnomen/

 

The usual format is:

<REFERENCE_SEQUENCE_ID>:<SEQUENCE_TYPE>.<POSITION><CHANGE>

"NM_002454.2:c.66A>G" is a cDNA change of Adenine to Guanine at position 66 in the cDNA ref sequence NM_002454 version 2.

A66G most commonly refers an amino acid change of Alanine to Glycine in some protein with some identifier (should ideally be "NP_xxxxxxx:p.A66G"). In your case though, I think you might be referring to a 66A>G nucleotide change as A66G, which is wrong.

 

Ref sequences can be g. (genomic), c. (cDNA), r. (RNA) or p. (protein)

changes can be single amino acid variants (p.Ala66Gly) single nucleotide variants (A>G), deletions(c.10delA), insertions (c.10_11insT), duplications (c.10dupA), indels(c.10_11delACinsTGA) etc

It is so well standardized that programs can parse and pick individual components for us to use.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Ram12k

Thanks for your help!

To confirm, the format "A66G" means something different than the format "66A>G"? 

The A66G format represents an amino acid change, but the 66A>G format represents a base change?

What is the REFERENCE_SEQUENCE_ID?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by jcorroon30
1

NM_002454.2:c.66A>G is a standardized coding HGVS identifier and is nucleotide-specific, where the REFERENCE_SEQUENCE_ID is the transcript NM_002454.2

"66A>G" alone could mean different things to different people

 

ADD REPLYlink written 2.4 years ago by Jeremy Leipzig17k

There are 29 different HGVS names for rs1801131, many with different prefixes (e.g. NM, XP, NG, NC, etc.).

Are all the HGVS names equivalent (i.e. Describing the same SNP)? 

ADD REPLYlink written 2.4 years ago by jcorroon30
1

Each of those prefixes mean a different type of reference sequence. For more information on Accession Number prefixes: http://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

They do describe the same polymorphism, but on different reference sequences. They could be on different locations on different mRNAs (alternative splicing), and some of the mRNAs could be "predicted" (XM). 

ADD REPLYlink written 2.4 years ago by Ram12k
2
gravatar for Jeremy Leipzig
2.4 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

the 1298 in A1298C must be relative to the mRNA, which includes the 5' UTR, or maybe just based an earlier prediction as to where the CDS began. I would imagine there is some historical reason it is still known by that name.

here is one of many papers where they discuss c.665C>T and c.1286A>C
http://www.nature.com/gim/journal/v15/n2/full/gim2012165a.html

HGVS coding nomenclature starts from the CDS start, and is transcript specific, hence NM_005957.4:c.1286A>C is the HGVS identifier for this mutation. As long as you include the c-dot people will know what you mean.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Jeremy Leipzig17k
1

Oh, the pain of the UTRs, and the sequence history! This numbering can be a nightmare - I spent a month bringing various variants up to speed on current references and tagging them to their source reference sequences.

ADD REPLYlink written 2.4 years ago by Ram12k

Thanks Jeremy. When identifying SNPs in this way in the future, should I always use the HGVS identifier with the "NM" prefix? Also, for the other example above (rs1801394) there are 3 HGVS names with "NM" prefixes. What would be the correct way of identifying this SNP?

Thanks SO much!

ADD REPLYlink written 2.4 years ago by jcorroon30
1

Ideally, you should. When you use NM_000123.2:, your naming is valid across space (all possible reference sequences for that gene) and time (all existing/future versions of that specific reference sequence). In other words, you will be really specific and this will help people that refer to your work in the future.

If you are working on the same reference sequence (say, across the entire project/manuscript), you could always mention the ID on top/in the Methods section and just go with c.66A>G everywhere else. This way, you might minimize typing effort.

ADD REPLYlink written 2.4 years ago by Ram12k

Thanks Ram. What is ideal if there are multiple reference sequence IDs for the same polymorphism? Include all of the reference sequence IDs in the Methods section, and then refer to all of the types, positions and changes elsewhere (i.e. c.66A>G, c.147A>G, c.-1995T>C)? When referring to a polymorphism, why is it ideal to use a reference sequence for a cDNA molecule (i.e. Using the NM accession prefix to identify the reference sequence) when you could use one for a genomic DNA molecule (i.e. Using AC or NC or NM)? Is one preferable to the other if you have reference sequences for both types of molecule?

NM_002454.2:c.66A>G
NM_024010.2:c.147A>G
NM_024091.3:c.-1995T>C

ADD REPLYlink modified 2.4 years ago by Ram12k • written 2.4 years ago by jcorroon30

Ideally, if there are multiple sequences, you pick the major transcript. This transcript usually is either associated best with the known function of the gene, expressed more ubiquitously or includes all exons (most often these factors coincide).

GenBank should give you a clear picture on this major transcript.

Ideally, you should specify both c. and g. co-ordinates. The catch is, g. can change over time but c. usually is stable, in my experience. But with current technologies, I doubt even g. will change. To cover all bases, specify the actual reference sequence (hg19/GRCh37 or hg38/GRCh38) and the chromosomal co-ordinates. The good part is, once you specify these once, say, in a table, you can assign a temporary identifier across the manuscript and refer to it with that identifier. 

I know I'm asking for too much, but I've been mining a ton of papers for information, and most of them did not future-proof their references - not that it is their fault, they did the best at their time.

ADD REPLYlink written 2.4 years ago by Ram12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1657 users visited in the last hour