I understand that a similar question has already been asked How To Know Which Is The Start Codon And Stop Codon For A Gene Sequence? but I am completely new to bioinformatics and would really appreciate it if someone could spell it out for me. The GenBank entry for the gene I am interested in is shown here, however it appears to list 3 different CDSs for this single gene. I am trying to construct a phylogeny of a paralog family, of which this gene is a member. Is there an accepted CDS of the 3 available that I should use, when performing a codon alignment or is there something else I am missing here? Is the CDS even what I should be looking at to construct the alignment for the phylogeny?
We face the same problem when doing our comparative genomics analysis in Ensembl. We try to pick a representative or canonical transcript using the following rules:
For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript.
So essentially, we use the longest one with the most evidence behind it. We don't in any way claim that this is the "best" transcript in terms of biological significance, but it is the most useful for comparative genomics as longer means more to work with, and more evidence means it's more likely to be real.
Some similar selection criteria should work for you.