I have a set of protein IDs. In reality there are thousands, but for this example, there are these three NCBI IDs:
XP_016775379.1 XP_008018068.1 XP_007991648.1
I want to extract, automatically (as there are in reality thousands of sequences), the coding sequence that encodes each of these proteins (i.e. the coding sequence should be 3 times as long as the protein sequence).
My problem is that, for example, the last protein "XP_007991648.1", is encoded by the mRNA: XM_007993457.1. The full mRNA sequence is here. However, the actual coding sequence that encodes the specific protein that I want (i.e. it is a subset of the full mRNA sequence) is here. So I do not want to extract the full mRNA sequence, only the section of the mRNA that encodes the particular protein.
The ultimate aim is to extract the longest canonical transcript for each gene, an issue I've been trying to solve here.You can see from this post, I have tried lots of different ways to get this to work, but so far no luck.
If anyone could tell me specifically how to do this, possibly with a working command that works for these examples so that I can see how it works, I would appreciate it.