6 months ago
Jason • 0

Hello all,

I am working on aligning proteins orthologs from different species. I am using the Ensembl API. Strangely, some protein sequences from non-human species have a lot of X. I wonder what does that mean? In theory, if their genome sequence is know, the protein sequence should be known, right? How do I score these X when I calculate the conservation scores? Thanks a lot. An example is shown below : ENSMEUP00000002410 from Notamacropus Eugenii.

MGLSGAAGAAVLVLLAGHFSLGTALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQKNYDLSFLKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXILVGGVRFNNNPTLCNVETIQWKDIVGSAYVSNITIDNNSHPKSXXXXXXXXXXXXXXXXXXXXXXXXTKTICAQQCSGRCRGSSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVRKCPHNYVVTDHGSCVRSCNAETYEVEEDGVRKCKKCEGPCSKVCNGIGIGEFKDVLSINATNIKQFQNCTTISGDLHILPVAFKGDSFTNTPPLDPKELNILRTVKEISGFLLIQAWPENMTDLHAFEHLEIIRGRTKQHGQFSLAVVGVDITSLGLRSLKEISDGDVIISKNRQLCYANTINWSKLFGTRSQKTKITNNKDEKECRALGHVCHELCSSDGCWGPSSSHCLSCRYVSRQKKCVEKCNILEGEPREYMENLKCLQCHPECLPQLMNQTCTGPGPDKCVQCAHYIDGPHCVKTCPAGIMGEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXPKIPSIATGIVGGFLLLMVLVLGIGLFIRRRRIVRKRTLRRLLQEREXXXXXXLSPPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYIREHKDNIGSQYLLNWCVQIAKGMSYLEERRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSVLEKGERLPQPPICTIDVYMIMVKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSATSNTSATVCIDRNGQQTCPVKEESFIQRYSSDPTTVLLEDNVDDSFQPVP

ENSMEUP00000002410 identifier seems to be pulling up Tammar wallaby entries.

6 months ago

if I remember correctly the X is the protein alternative for N in nucleotides, in other words an unknown aminoacid (and unknown as in "it couldn't be determined" not as in "new, never seen before").

this can happen if the genome where the gene/protein is determined in still has (quite some) Ns in the genomic sequence. if an N appears in the 'wrong' position in a codon you can't determine which AA it will result to and as such it is 'translated' as an X

This is correct. X means any amino acid. Most substitution matrices apply identical penalty (-1) when any amino-acid is aligned with X - even when X aligns with another X.

5 weeks ago

As lieven.sterck said, X is often used to denote an unkown amino acid, and Ensembl certainly seems to use this convention as evidenced by long stretches of X's in some sequences. However, I've also noticed instances where it appears in the protein sequence even though directly translating the corresponding Ensembl coding sequence (CDS) would result in a stop codon at that position. This happens in multiple CDS/protein pairs (e.g. ENST00000673047.2 and ENST00000229022.9 in the human CDS/protein files, Ensembl release 104).

I think it is possible that Ensembl is using it to signify something else (in addition to an unknown amino acid), but I have yet to identify a pattern or find any info on this.

the case you describe can be due to a frameshift error (or other error) in the genomic sequence introducing a premature stop-codon. If there is other evidence it should not be there, Ensembl might decide to circumvent this by putting an X instead of the 'translated" stopcodon, thus to indicate the true/correct protein continues beyond this erroneous stop codon

I contacted the help desk at Ensembl and they said these are cases where readthrough of the stop codon is known to occur. So it seems like these are not "erroneous" stop codons, but rather indicators of a (relatively) unusual yet biologically event in vivo. Here is their complete response: "The instances of stop codons being represented as X in the corresponding amino acid sequence occur where the manual curators have annotated the transcript as a 'stop-codon read through'. You can see this annotation in the 'Annotation Attributes' section at the bottom of the Transcript summary page: http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g= ENSG00000111716;r=12:21635342-21657842;t=ENST00000673047"