Ensembl transcripts sequence start with "N"
1
0
Entering edit mode
7.4 years ago
Feng Xie • 0

Hello everyone, I'd like to explore the functional impact of some InDels on its protein product. So I annotated these mutations with Ensembl and got the corresponding transcript. Then, I translate these transcripts into peptides with InDels to do downstream analysis.

But some transcripts from the Ensembl with one or more "N" at the beginning of the sequence, so I don't know how to deal with these transcripts.

Bellow is a example transcript start with "N from Ensembl:

NTCAGGACAGTCGCTGAAATCACTGAATGCCTCCTCAGGTCATTTAGCACTTATTTTATCCAGTATCTTTGGGCTCCTTCTCCTGGTTCTGTTTATTCTATTTCTCACGTGGTGCCGAGTTCAGAAACAAAAACATCTGCCCCTCAGAGTTTCAACCAGAAGGAGGGGTTCTCTCGAGGAGAATTTATTCCATGAGATGGAGACCTGCCTCAAGAGAGAGGACCCACATGGGACAAGAACCTCAGATGACACCCCCAACCATGGTTGTGAAGATGCTAGCGACACATCGCTGTTGGGAGTTCTTCCTGCCTCTGAAGCCACAAAATGA

So, my questions are :

  1. why the transcripts with "N"?
  2. How to deal with "N" when we translate the transcript into peptides?
  3. Or should we just drop these transcripts with "N"?

Thank you very much.

sequence gene SNP • 2.0k views
ADD COMMENT
0
Entering edit mode

Someone from Ensembl may be along with an informed comment but perhaps the N was added to shift the frame (if that frame is producing sane results)? Just speculating.

ADD REPLY
0
Entering edit mode

Thank you. I find the corresponding peptides of the example transcript from Ensembl, bellow is the sequence :

XSGQSLKSLNASSGHLALILSSIFGLLLLVLFILFLTWCRVQKQKHLPLRVSTRRRGSLEENLFHEMETCLKREDPHGTRTSDDTPNHGCEDASDTSLLGVLPASEATK*

It seems that the "N" does't make any sense here, because the transcript start to translate from the second nucleotide. So the "N" here looks weird.

ADD REPLY
0
Entering edit mode

Sounds reasonable, assuming that either zero, one or two Ns are at the beginning - which OP could check?

ADD REPLY
2
Entering edit mode
7.4 years ago
Michael 54k

The sequence looks like a fragment of a transcript, no start codon in the potential coding part. Indeed when you put the sequence into getorf, the longest AA sequence is translated from position 2 forward. The X in the beginning of the translation means 'unknown', maybe coming from the first N. Possibly the N was inserted to indicate the phase. What is the Ensembl ID of this transcript? Running the sequence through blast shows it is likely from a primate cysteine rich scavenger receptor, e.g. XP_011518923.1. However, there are full-length RefSeq transcripts for these, so I don't understand why you are working with a fragment.

Here is the likely translation of your fragment from getorf:

>_5 [2 - 325] 
SGQSLKSLNASSGHLALILSSIFGLLLLVLFILFLTWCRVQKQKHLPLRVSTRRRGSLEE
NLFHEMETCLKREDPHGTRTSDDTPNHGCEDASDTSLLGVLPASEATK
ADD COMMENT
0
Entering edit mode

Thank you very much. Yes, this transcript looks like fragment, and its Ensembl ID is ENST00000539726 and corresponding protein ID is ENSP00000438217. I'm wondering how these kinds of transcript are generated?

ADD REPLY
2
Entering edit mode

Yes, this transcript is tagged "CDS 5' Incomplete". We annotate transcripts by aligning protein and cDNA sequences to the genome (EST alignments are displayed on our website but are usually not used as supporting evidence in the Ensembl annotation process). If the evidence is truncated/incomplete, we will not extrapolate to generate a complete transcript beyond the evidence supporting it.

Details about our annotation pipeline: http://www.ensembl.org/info/genome/genebuild/genome_annotation.html

Supporting (aligned) evidence for your example transcript: http://www.ensembl.org/Homo_sapiens/Transcript/SupportingEvidence?db=core;g=ENSG00000177675;r=12:7346685-7368970;t=ENST00000539726

Hope this helps.

ADD REPLY
0
Entering edit mode

It's really helps. Thank you :)

ADD REPLY

Login before adding your answer.

Traffic: 2397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6