Question: Ensembl transcripts sequence start with "N"
2.5 years ago by
Hello everyone, I'd like to explore the functional impact of some InDels on its protein product. So I annotated these mutations with Ensembl and got the corresponding transcript. Then, I translate these transcripts into peptides with InDels to do downstream analysis.

But some transcripts from the Ensembl with one or more "N" at the beginning of the sequence, so I don't know how to deal with these transcripts.

Bellow is a example transcript start with "N from Ensembl:


So, my questions are :

  1. why the transcripts with "N"?
  2. How to deal with "N" when we translate the transcript into peptides?
  3. Or should we just drop these transcripts with "N"?

Thank you very much.

snp sequence gene • 847 views
Someone from Ensembl may be along with an informed comment but perhaps the N was added to shift the frame (if that frame is producing sane results)? Just speculating.

Thank you. I find the corresponding peptides of the example transcript from Ensembl, bellow is the sequence :


It seems that the "N" does't make any sense here, because the transcript start to translate from the second nucleotide. So the "N" here looks weird.

Sounds reasonable, assuming that either zero, one or two Ns are at the beginning - which OP could check?

2.5 years ago by
Bergen, Norway
The sequence looks like a fragment of a transcript, no start codon in the potential coding part. Indeed when you put the sequence into getorf, the longest AA sequence is translated from position 2 forward. The X in the beginning of the translation means 'unknown', maybe coming from the first N. Possibly the N was inserted to indicate the phase. What is the Ensembl ID of this transcript? Running the sequence through blast shows it is likely from a primate cysteine rich scavenger receptor, e.g. XP_011518923.1. However, there are full-length RefSeq transcripts for these, so I don't understand why you are working with a fragment.

Here is the likely translation of your fragment from getorf:

>_5 [2 - 325] 
Thank you very much. Yes, this transcript looks like fragment, and its Ensembl ID is ENST00000539726 and corresponding protein ID is ENSP00000438217. I'm wondering how these kinds of transcript are generated?

Yes, this transcript is tagged "CDS 5' Incomplete". We annotate transcripts by aligning protein and cDNA sequences to the genome (EST alignments are displayed on our website but are usually not used as supporting evidence in the Ensembl annotation process). If the evidence is truncated/incomplete, we will not extrapolate to generate a complete transcript beyond the evidence supporting it.

Details about our annotation pipeline:

Supporting (aligned) evidence for your example transcript:;g=ENSG00000177675;r=12:7346685-7368970;t=ENST00000539726

Hope this helps.

It's really helps. Thank you :)

