How To Calculate The Protein Change And Codon Position Within A Nucleotide Sequence Of A Single Nucleotide Substitution?
3
4
Entering edit mode
13.1 years ago
Blunders ★ 1.1k

Appears that there is a misunderstanding on my part, and this question is an attempt to sort out these conflicts.

Definitions:

  • Protein change being a non-silent mutation where a single nucleotide substitution results in
  • Codon position being its count as the nth protein within an exon.

The source of these conflicts started with a question regarding how to calculate the protein change and codon position within a protein sequence of a single nucleotide substitution; in this case, the reference nucleotide sequence being from Ensembl. This question was flagged as a duplicate, but referenced this question and answer. My reading of the answer, which must be wrong, is that to get the protein change, first get the exon starting location, and then walk the nucleotide sequence by 3 until the location of the mutation is contained within the nucleotide sequence being walked.

What am I missing to achieve my goal of calculating the protein change and codon position within a protein sequence of a single nucleotide substitution from scratch? While I would like to do this from scratch, if possible, I would like to be able to script the calculations; ideally in Perl.

Additionally, if it’s not clear am aware of how to three-nucleotide sequences to their amino acid, if I know which three-nucleotide sequences to process.

DISCLAIMER: If it’s not clear, I have zero background in bioinformatics or microbiology. If using non-plain English terms, where possible please attempt to pair terms with a plain English contextualized explanation. Many thanks in advance.

mutation protein codon • 12k views
ADD COMMENT
7
Entering edit mode
13.1 years ago

Here is my genomic recipe: for a given transcript, you need:

  • the chromosome
  • the genetic code for this chromosome
  • the strand of this gene on the chromosome
  • the reference sequence of this chromosome
  • the start position of the transcription (txStart)
  • the end position of the transcription (endStart)
  • the start position of the translation (cdsStart)
  • the end position of the translation (cdsStart)
  • the number of exons
  • and for each exon the start and the end position of this exon.

and the pseudo-code using some half-open segments

if(gene is on '+' strand)
    {
    if(position < gene.cdsStart)
        {
        return it's an UTR5
        }
    else if( gene.cdsEnd <= position )
        {
        return  it's an UTR3'
        }

    for(exon: gene.exons)
        {
        for( i= exon.start;
            i< exon.end;
            ++i)
            {
            if(i==position)
                    {
                    info:" it's THE exon"
                    }
            if  i< gene.cdsStart continue;
            if i>=gene.cdsEnd break;

            if(i==position)
                {
                position_in_cdna=size(RNA);
                }

            wildRNA+= genomicSeq[i];

            if(i==position)
                {
                mutRNA[position_in_cdna]=alternate_base;
                }
            }
        }
    }
else // reverse orientation
    {

    if(position < gene.cdsStart)
        {
        return  it's an UTR3'
        }
    else if( gene.cdsEnd<=position )
        {
        return  it's an UTR5'
        }
    for-each(exon: reverse-order(gene.exons))
        {
        for(i= exon.end-1;
            i>= exon.start;
            --i)
            {
            if(i==position)
                    {
                    info("It's THE exon");
                    }
            if(i>= gene.cdsEnd) continue;
            if(i<  gene.cdsStart) break;

            if(i==position)
                {
                position_in_cdna=size(wildRNA);
                }

            wildRNA += complement(genomicSeq[i]);
            mutRNA +=complement(genomicSeq[i]);

            if(i==position)
                {
                amutRNA[position_in_cdna]=complement(alternate_base);
                }
            }
        }

    }//end of if reverse
return if its was in an intron 
mutProt=translate(mutRNA);
wildProt=translate(wildRNA);
pos_aa=position_in_cdna/3;

if(isStop(wildProt[pos_aa]) && !isStop(mutProt[pos_aa])))
    {
    return "EXON_STOP_LOST";
    }
else if( !isStop(wildProt[pos_aa]) && isStop(mutProt[pos_aa]))
    {
    return "EXON_STOP_GAINED";
    }
else if(wildProt[pos_aa]==mutProt[pos_aa])
    {
    return "EXON_CODING_SYNONYMOUS")
    }
else
    {
    return "EXON_CODING_NON_SYNONYMOUS"
    }
ADD COMMENT
0
Entering edit mode

+1 @Pierre Lindenbaum: The list of required input and the pseudo-code is a huge help - big thank you. The zero-based, half-open base coordinate system is interesting too! Plan to get to this in the next few days. Cheers!

ADD REPLY
3
Entering edit mode
13.1 years ago

I think the confusion lies in what Pierre wrote in the answer:

Using the reference sequence for 'chr11', you then 'just' have to walk over each exon to translate your protein until you've found your amino acid.

What he means here is not "walk over each exon separately", but rather "walk over all the exons in a gene, starting with the first exon". That should take care of the confusion with respect to point 2 in my previous answer. Pierre also referred to "the first translated base" in his answer, which implies that he skips the 5' UTR before he starts counting. That takes care of point 1 in my previous answer. So his and my answer are both correct, and they are also not inconsistent with each other.

However, it is very important to realize that:

  • You must start the "walk" at the first translated base, not the first base in exon 1.
  • You have to do it as one long walk through all the exons of a gene, you cannot treat each exon separately.
  • The definition of "codon position" is the number of the codon in the entire gene, and not its position within an individual exon.

I hope this helps :-)

ADD COMMENT
3
Entering edit mode
13.1 years ago
Michael 54k

The problem is possibly that you need to read up on the mechanism of translation, my feeling is that there is a confusion there. There are some terms you are mixing up. In fact, you can forget about the exon start for now, that has nothing to do with translation of the nucleotide sequence into an AA sequence.

Translation starts at a start codon and stops at the stop codon. The stop codon determins the reading frame of the sequence. The amino-acid sequence can be determined from the DNA sequence using the genetic code where codons are translated to amino-acids.

To determine the effect of a point mutation you will have to:

  • Combine the exons that form the gene into a single sequence (splice) (there are often alternative variants of how exons can be combined) and remove the introns.
  • Identify the relevant open reading frame, the region between start and stop codon that makes the gene AA sequence (there could be many)
  • Go to the codon with the mutation and translate both versions, then you will see the effect

You will see that splicing make the caculations of effect more difficult and probabilistic, as there are possibly many variants. There is however software that will attempt to predict the effect of variants and support you with this, like the Ensembl variant effect predictor.

ADD COMMENT

Login before adding your answer.

Traffic: 1983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6