Question: Ensembl transcripts sequence start with "N"
gravatar for Feng Xie
3.5 years ago by
Feng Xie0
Feng Xie0 wrote:

Hello everyone, I'd like to explore the functional impact of some InDels on its protein product. So I annotated these mutations with Ensembl and got the corresponding transcript. Then, I translate these transcripts into peptides with InDels to do downstream analysis.

But some transcripts from the Ensembl with one or more "N" at the beginning of the sequence, so I don't know how to deal with these transcripts.

Bellow is a example transcript start with "N from Ensembl:


So, my questions are :

  1. why the transcripts with "N"?
  2. How to deal with "N" when we translate the transcript into peptides?
  3. Or should we just drop these transcripts with "N"?

Thank you very much.

snp sequence gene • 1.1k views
ADD COMMENTlink modified 3.5 years ago by Michael Dondrup47k • written 3.5 years ago by Feng Xie0

Someone from Ensembl may be along with an informed comment but perhaps the N was added to shift the frame (if that frame is producing sane results)? Just speculating.

ADD REPLYlink written 3.5 years ago by genomax84k

Thank you. I find the corresponding peptides of the example transcript from Ensembl, bellow is the sequence :


It seems that the "N" does't make any sense here, because the transcript start to translate from the second nucleotide. So the "N" here looks weird.

ADD REPLYlink written 3.5 years ago by Feng Xie0

Sounds reasonable, assuming that either zero, one or two Ns are at the beginning - which OP could check?

ADD REPLYlink written 3.5 years ago by WouterDeCoster43k
gravatar for Michael Dondrup
3.5 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

The sequence looks like a fragment of a transcript, no start codon in the potential coding part. Indeed when you put the sequence into getorf, the longest AA sequence is translated from position 2 forward. The X in the beginning of the translation means 'unknown', maybe coming from the first N. Possibly the N was inserted to indicate the phase. What is the Ensembl ID of this transcript? Running the sequence through blast shows it is likely from a primate cysteine rich scavenger receptor, e.g. XP_011518923.1. However, there are full-length RefSeq transcripts for these, so I don't understand why you are working with a fragment.

Here is the likely translation of your fragment from getorf:

>_5 [2 - 325] 
ADD COMMENTlink written 3.5 years ago by Michael Dondrup47k

Thank you very much. Yes, this transcript looks like fragment, and its Ensembl ID is ENST00000539726 and corresponding protein ID is ENSP00000438217. I'm wondering how these kinds of transcript are generated?

ADD REPLYlink written 3.5 years ago by Feng Xie0

Yes, this transcript is tagged "CDS 5' Incomplete". We annotate transcripts by aligning protein and cDNA sequences to the genome (EST alignments are displayed on our website but are usually not used as supporting evidence in the Ensembl annotation process). If the evidence is truncated/incomplete, we will not extrapolate to generate a complete transcript beyond the evidence supporting it.

Details about our annotation pipeline:

Supporting (aligned) evidence for your example transcript:;g=ENSG00000177675;r=12:7346685-7368970;t=ENST00000539726

Hope this helps.

ADD REPLYlink written 3.5 years ago by Ensembl Helen60

It's really helps. Thank you :)

ADD REPLYlink written 3.5 years ago by Feng Xie0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1186 users visited in the last hour