Amino Acid Change To Genomic Location
6
11
Entering edit mode
10.6 years ago
Preethi ▴ 110

This is the reverse of what is asked in http://biostar.stackexchange.com/questions/6297/genomic-change-to-aa-change

I have a list of Gene names and related amino acid changes that I am looking for and would like to know the genomic co-ordinates of the SNP locations causing these. What is the best way to get to this?

snp amino-acids • 25k views
0
Entering edit mode

Here is an example of what I was talking about: BRAF.p.V600E:c.1799T>A

Given just this information, is there a way I can get to the genomic coordinates. I guess there is the CDS position, can this direct us to the genomic location?

0
Entering edit mode

Here is an example of what I was talking about: BRAF.p.V600E:c.1799T>A Given just this information, is there a way I can get to the genomic coordinates. I guess there is the CDS position, does this help?

9
Entering edit mode
10.6 years ago

I wrote a tool named backlocate for this job .

This tool is available in my experimental package http://code.google.com/p/variationtoolkit/.

example:

echo -e  "NOTCH2\tM1T\nEIF4G1\tD240Y" |\
backlocate -f /path/to/hg19.fa

#User.Gene    AA1    petide.pos.1    AA2    knownGene.name    knownGene.strand    knownGene.AA    index0.in.rna    codon    base.in.rna    chromosome    index0.in.genomic    exon
##uc001eik.2
NOTCH2    M    1    T    uc001eik.2    -    M    0    ATG    A    chr1    120612019    Exon 1
NOTCH2    M    1    T    uc001eik.2    -    M    1    ATG    T    chr1    120612018    Exon 1
NOTCH2    M    1    T    uc001eik.2    -    M    2    ATG    G    chr1    120612017    Exon 1
##uc001eil.2
NOTCH2    M    1    T    uc001eil.2    -    M    0    ATG    A    chr1    120612019    Exon 1
NOTCH2    M    1    T    uc001eil.2    -    M    1    ATG    T    chr1    120612018    Exon 1
NOTCH2    M    1    T    uc001eil.2    -    M    2    ATG    G    chr1    120612017    Exon 1
##uc001eim.3
NOTCH2    M    1    T    uc001eim.3    -    M    0    ATG    A    chr1    120548116    Exon 2
NOTCH2    M    1    T    uc001eim.3    -    M    1    ATG    T    chr1    120548115    Exon 2
NOTCH2    M    1    T    uc001eim.3    -    M    2    ATG    G    chr1    120548114    Exon 2
##Warning ref aminod acid for uc003fnp.2  [240] is not the same (I/D)
EIF4G1    D    240    Y    uc003fnp.2    +    I    717    ATC    A    chr3    184039089    Exon 10
EIF4G1    D    240    Y    uc003fnp.2    +    I    718    ATC    T    chr3    184039090    Exon 10
EIF4G1    D    240    Y    uc003fnp.2    +    I    719    ATC    C    chr3    184039091    Exon 10
##Warning ref aminod acid for uc003fnu.3  [240] is not the same (I/D)
EIF4G1    D    240    Y    uc003fnu.3    +    I    717    ATC    A    chr3    184039089    Exon 9
EIF4G1    D    240    Y    uc003fnu.3    +    I    718    ATC    T    chr3    184039090    Exon 9
EIF4G1    D    240    Y    uc003fnu.3    +    I    719    ATC    C    chr3    184039091    Exon 9
##Warning ref aminod acid for uc003fnq.2  [240] is not the same (V/D)
EIF4G1    D    240    Y    uc003fnq.2    +    V    717    GTA    G    chr3    184039350    Exon 7
EIF4G1    D    240    Y    uc003fnq.2    +    V    718    GTA    T    chr3    184039351    Exon 7
EIF4G1    D    240    Y    uc003fnq.2    +    V    719    GTA    A    chr3    184039352    Exon 7
##Warning ref aminod acid for uc003fnr.2  [240] is not the same (L/D)
EIF4G1    D    240    Y    uc003fnr.2    +    L    717    CTC    C    chr3    184039581    Exon 6
EIF4G1    D    240    Y    uc003fnr.2    +    L    718    CTC    T    chr3    184039582    Exon 6
EIF4G1    D    240    Y    uc003fnr.2    +    L    719    CTC    C    chr3    184039583    Exon 6
##Warning ref aminod acid for uc003fny.3  [240] is not the same (T/D)
EIF4G1    D    240    Y    uc003fny.3    +    T    717    ACC    A    chr3    184039677    Exon 3
EIF4G1    D    240    Y    uc003fny.3    +    T    718    ACC    C    chr3    184039678    Exon 3
EIF4G1    D    240    Y    uc003fny.3    +    T    719    ACC    C    chr3    184039679    Exon 3
##uc010hxx.2
EIF4G1    D    240    Y    uc010hxx.2    +    D    717    GAT    G    chr3    184038780    Exon 10
EIF4G1    D    240    Y    uc010hxx.2    +    D    718    GAT    A    chr3    184039069    Exon 11
EIF4G1    D    240    Y    uc010hxx.2    +    D    719    GAT    T    chr3    184039070    Exon 11
##Warning ref aminod acid for uc003fns.2  [240] is not the same (L/D)
EIF4G1    D    240    Y    uc003fns.2    +    L    717    CTC    C    chr3    184039209    Exon 10
EIF4G1    D    240    Y    uc003fns.2    +    L    718    CTC    T    chr3    184039210    Exon 10
EIF4G1    D    240    Y    uc003fns.2    +    L    719    CTC    C    chr3    184039211    Exon 10

0
Entering edit mode

@Pierre: I'm wondering if your tool is applicable to other species? I mean, if I have a gff3+genome sequence in fasta format, will I be able to use your tool to back locate SNP identified in RNAseq data to their genomic location?

0
Entering edit mode

"if your tool is applicable to other species" : yes, as long as you have a fasta sequence and a knownGene-like table in mysql.

0
Entering edit mode
@Pierre: I'd love to try your tool out, but would like to know which version you recommend. I have been having a hard-time using the version from your github (compiled everything, but getting a "java.lang.ClassNotFoundException: com.mysql.jdbc.Driver" error. Tried adding the Mysql Connector Driver in my \$CLASSPATH but still getting the error), and the one from your 'variationtoolkit' needs some dependencies that I can't seem to install properly for compilation (tabix).
0
Entering edit mode

Hi Pierre,

Is this tool still available on the link you have pointed out? I could not download anything following this "How to Install the Variation toolkit" (https://code.google.com/p/variationtoolkit/wiki/HowToInstall) instructions. I need to use your backlocate tool. Thanks!

0
Entering edit mode

To anyone who is still trying to use this, best thing I could do to get this to compile was to follow the instructions here: https://github.com/lindenb/jvarkit/wiki/BackLocate (use standalone=yes instruction) Worked on my macOS Sierra, and on CentOS.

6
Entering edit mode
6.6 years ago

http://bioinformatics.mdanderson.org/main/Transvar

Introduction

TransVar is a reverse annotator for inferring genomic characterization(s) of mutations (e.g., chr3:178936091 G/A) from protein or cDNA annotation(s) (e.g., PIK3CA p.E545K or PIK3CA c.1633G>A). It is designed for resolving ambiguous mutation origins, arising from alternative splicing.

TransVar has the following features:

• supports HGVS nomenclature
• supports both left-alignment and right-alignment convention in reporting indels.
• supports annotation of a region based on a transcript dependent characterization
• supports single nucleotide variation (SNV), insertions and deletions (indels) and block substitutions
• supports mutations at both coding region and intronic/UTR regions
• supports transcript annotation from commonly-used databases such as Ensembl, NCBI RefSeq and GENCODE etc
• supports UniProt protein id as transcript id
• supports GRCh36, 37, 38
• functionality of forward annotation.

Citation: Zhou W, Chen T, Chong Z, Rohrdanz MA, Melott JM, Wakefield C, Zeng J, Weinstein JN, Meric-Bernstam F, Mills GB, Chen K. TransVar: a multi-level variant annotator for precision genomics. Nature Methods. In Press.

0
Entering edit mode

This saved me a lot of time. Thanks!

0
Entering edit mode

I just want to point out that the project has moved to GitHub

5
Entering edit mode
9.5 years ago
Emily 23k

Have you tried the Ensembl REST API?

To convert from protein coordinates http://beta.rest.ensembl.org/documentation/info/assembly_translation

To convert from transcript coordinates

You'd need to get the Ensembl protein or transcript IDs, which you can get easily using BioMart (tutorial on BioMart here

Then, for your query, BRAF.p.V600E:c.1799T>A, input

From protein

http://beta.rest.ensembl.org/map/translation/ENSP00000288602/600..600?content-type=application/json

Output

{"mappings":[{"seq_region_name":"7","gap":0,"coord_system":"chromosome","strand":-1,"rank":0,"end":140453135,"start":140453137}]


From transcript

http://beta.rest.ensembl.org/map/cds/ENST00000288602/1799?content-type=application/json

{"mappings":[{"seq_region_name":"7","gap":0,"coord_system":"chromosome","strand":-1,"rank":0,"end":140453136,"start":140453136}]}

0
Entering edit mode

I like the idea of being able to use ensembl, but is there any way to get the nucleotide as well! My example variant was: VAR_031436 Q9NXK6 MPRG_HUMAN p.Ile24Thr
Which I need nucleotide and chromosome-coordinate for, and this is the corresponding link which gives me chr-location, but not the corresponding nucleotide change (am assuming it would be just one possible nucleotide combination of reference and alternate, so there is no ambiguity and one-one correspondence) http://beta.rest.ensembl.org/map/translation/ENSP00000343877/24..24?content-type=application/json

Any thoughts?

1
Entering edit mode

You can get this using the Ensembl VEP. http://www.ensembl.org/info/docs/tools/vep/index.html.

Use the annotation you have there, selecting HGVS annotation as your input file format. This will tell you the location you hit, the nucleotide/codon change, the gene/transcript/protein sequence it hits.

0
Entering edit mode

Emily, thanks for pointing that! I was able to format my data to HGVS and use VEP to obtain coordinate and codon. However, some do not get any output via VEP!! ENSP00000256339:p.Val1597Ala returns a blank in VEP. Though it works just fine in polyphen2! http://genetics.bwh.harvard.edu/ggi/pph2/3a7d3e9d7dc0c940e1b608c62ba2c28296dcfe2b/1771738.html

0
Entering edit mode

Hi. I think the issue here is that we don't have a Valine annotated at position 1597 in ENSP00000256339. We have that position as a Proline (http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?db=core;g=ENSG00000133958;r=14:93799565-94173618;t=ENST00000256339). I put in the input ENSP00000256339:p.Pro1597Ala instead and got four hits. Perhaps this is the wrong protein ID?

0
Entering edit mode

The uniprot ID Q9P2D8 supposedly maps to this ENSP which seems incorrect! In fact, it should have been ENSP00000376858, which works out fine with VEP as well. Am checking with the Uniprot team on that.

Thanks once again!

0
Entering edit mode

To convert from transcript coordinates you'd need to get the Ensembl protein or transcript IDs, which you can get easily using BioMart (tutorial on BioMart here

@Emily_Ensembl I think you forgot to include the link to the BioMart tutorial on how to go from genomic coordinate to transcript ID. Can you post it here? Thanks.

0
Entering edit mode
10.6 years ago

You can find the 3 b.p. codon which encodes the amino acid, but you need more information if you want single base pair resolution...

0
Entering edit mode
9.7 years ago
nonish5 ▴ 40

Can I use it if I don't know the amino acid position of the protein? I addition, as far as I understand, gene name and mutation info (either of the form c.123G>T or IVS4+1G>T) are not enough in order to deduce a specific genomic location as there may be more than a single transcript. Am I wrong?

0
Entering edit mode

f I don't know the amino acid position of the protein? you could loop over all the positions of the protein. "are not enough in order to deduce a specific genomic location as there may be more than a single transcript": of course. Furthermore, the very same protein can be encoded by two mRNA.

0
Entering edit mode
9.5 years ago
Christian ★ 3.0k

Albeit built for mapping protein sequence intervals, my script "protein2genome.pl" can do this. It ships with the variant annotation tool CooVar.

Here is how it works. First you need a GFF or a GTF file with the coordinates of your genes. To test your specific example, I created a GTF file that contains only the first isoform of the BRAF gene, but you could also work with GFF/GTF files containing all human genes and isoforms or containing genes from any other organism. Then you can run protein2genome.pl like this:

echo "ENST00000288602 . . 600 600 . . . ID=V600E" | perl protein2genome.pl BRAF-001.gtf


Produces the output:

7    .    .    140453135    140453137    .    -    .    ID=V600E(ENST00000288602);segment=1of1;p_start=600;p_end=600


Explanation: I am basically piping a GFF-compliant input line specifying the ID of the transcript and the position of the protein sequence change into the script. The script then outputs the mapped genomic coordinates (chromosome 7, codon start=140453135, codon-end=140453137).

A more detailed explanation of input and output formats can be found here.

I would say this script is more useful for non-model organisms, because for model organisms with associated databases you have other possibilities to do that (see other answers in this thread, for example the Ensembl REST API which is quite neat).