So, I want to map uniprot proteine(main isoform) to ensembl(coding sequence) to estimate Ka/KS in very close related species.
I went to uniprot and downloaded protein sequences and ensembl id (enst) I converted enst to ensg (because, if I had understood) a ensg represent a physical location in genome and enst are variants. Until this point everything is ok. But then I try to get the corresponding coding sequence . I went to ensembl and download sequence with ensg. For each ensg i look for the enst who codes the corresponding protein. a large amount of ensg (~50% ) don't have transcript which exaclty match the proteine sequence.
I had more succes using exonerate on cds sequence (from ensembl) 6% of proteine/dna sequence have (mistmatch,indel, insert). This is clearly better, but:
is the exonerate way a good way to do this?
why this amount of non matching uniprot proteine ensemble coding sequence?