X amino acid in ensembl
2
0
Entering edit mode
3.0 years ago
Jason • 0

Hello all,

I am working on aligning proteins orthologs from different species. I am using the Ensembl API. Strangely, some protein sequences from non-human species have a lot of X. I wonder what does that mean? In theory, if their genome sequence is know, the protein sequence should be known, right? How do I score these X when I calculate the conservation scores? Thanks a lot. An example is shown below : ENSMEUP00000002410 from Notamacropus Eugenii.

MGLSGAAGAAVLVLLAGHFSLGTALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQKNYDLSFLKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXILVGGVRFNNNPTLCNVETIQWKDIVGSAYVSNITIDNNSHPKSXXXXXXXXXXXXXXXXXXXXXXXXTKTICAQQCSGRCRGSSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVRKCPHNYVVTDHGSCVRSCNAETYEVEEDGVRKCKKCEGPCSKVCNGIGIGEFKDVLSINATNIKQFQNCTTISGDLHILPVAFKGDSFTNTPPLDPKELNILRTVKEISGFLLIQAWPENMTDLHAFEHLEIIRGRTKQHGQFSLAVVGVDITSLGLRSLKEISDGDVIISKNRQLCYANTINWSKLFGTRSQKTKITNNKDEKECRALGHVCHELCSSDGCWGPSSSHCLSCRYVSRQKKCVEKCNILEGEPREYMENLKCLQCHPECLPQLMNQTCTGPGPDKCVQCAHYIDGPHCVKTCPAGIMGEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXPKIPSIATGIVGGFLLLMVLVLGIGLFIRRRRIVRKRTLRRLLQEREXXXXXXLSPPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYIREHKDNIGSQYLLNWCVQIAKGMSYLEERRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSVLEKGERLPQPPICTIDVYMIMVKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSATSNTSATVCIDRNGQQTCPVKEESFIQRYSSDPTTVLLEDNVDDSFQPVP

aminoacid alignment ensembl protein sequence • 2.4k views
ADD COMMENT
0
Entering edit mode

ENSMEUP00000002410 identifier seems to be pulling up Tammar wallaby entries.

ADD REPLY
2
Entering edit mode
3.0 years ago

if I remember correctly the X is the protein alternative for N in nucleotides, in other words an unknown aminoacid (and unknown as in "it couldn't be determined" not as in "new, never seen before").

this can happen if the genome where the gene/protein is determined in still has (quite some) Ns in the genomic sequence. if an N appears in the 'wrong' position in a codon you can't determine which AA it will result to and as such it is 'translated' as an X

ADD COMMENT
0
Entering edit mode

This is correct. X means any amino acid. Most substitution matrices apply identical penalty (-1) when any amino-acid is aligned with X - even when X aligns with another X.

ADD REPLY
0
Entering edit mode
2.6 years ago

As lieven.sterck said, X is often used to denote an unkown amino acid, and Ensembl certainly seems to use this convention as evidenced by long stretches of X's in some sequences. However, I've also noticed instances where it appears in the protein sequence even though directly translating the corresponding Ensembl coding sequence (CDS) would result in a stop codon at that position. This happens in multiple CDS/protein pairs (e.g. ENST00000673047.2 and ENST00000229022.9 in the human CDS/protein files, Ensembl release 104).

I think it is possible that Ensembl is using it to signify something else (in addition to an unknown amino acid), but I have yet to identify a pattern or find any info on this.

ADD COMMENT
0
Entering edit mode

the case you describe can be due to a frameshift error (or other error) in the genomic sequence introducing a premature stop-codon. If there is other evidence it should not be there, Ensembl might decide to circumvent this by putting an X instead of the 'translated" stopcodon, thus to indicate the true/correct protein continues beyond this erroneous stop codon

ADD REPLY
0
Entering edit mode

I contacted the help desk at Ensembl and they said these are cases where readthrough of the stop codon is known to occur. So it seems like these are not "erroneous" stop codons, but rather indicators of a (relatively) unusual yet biologically event in vivo. Here is their complete response: "The instances of stop codons being represented as X in the corresponding amino acid sequence occur where the manual curators have annotated the transcript as a 'stop-codon read through'. You can see this annotation in the 'Annotation Attributes' section at the bottom of the Transcript summary page: http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g= ENSG00000111716;r=12:21635342-21657842;t=ENST00000673047"

ADD REPLY

Login before adding your answer.

Traffic: 2643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6