Hey!
I'm trying to do some comparative genomics on different vertebrate proteins (cetaceaens whales more specifically). So I downloaded several RefSeq protein fasta's for the proteins of interest in different cetacea species from NCBI. To check if there are any "weird" proteins (e.p. truncated or premature stopped proteins), I aligned them and made a phylogenetic tree, with nothing that seemed too out of the ordinary.
However, recently I came across a manuscript showing that some of the genes I'm interested in are pseudogenised/lost. This seemed weird to me, as the protein alignments were looking ok, so I looked a bit more in detail. I went to the gene page in NCBI, and found that in the gene annotation the following note is attached to the protein entry:
The sequence of the model RefSeq protein was modified relative to this genomic sequence to represent the inferred CDS: inserted 6 bases in 5 codons; deleted 2 bases in 2 codons; substituted 2 bases at 2 genomic stop codons
So, despite the genomic sequence showing that the gene is likely a pseudogene, and the resulting protein should either be severely truncated (the first stop is in the first exon), or missing (as it might not even be expressed). So basically the automatic RefSeq/NCBI-annotated sequences (XP_/XM_) are altered to "fit" better with known, confirmed proteins.
So I was wondering if other people had similar experiences with RefSeq proteins, and how to best deal with them? Is it possible that some of the premature stop codons and/or frame-shifting indels are "ignored" during transcription/translation (in exons of course), so the "RefSeq" protein is still produced? Or is more likely that the genomic sequence represents the "real truth", and I should mainly focus on that? Of course this depends on the genome assembly quality, but I still found this issue in very recent, high-quality, chromosome-level assemblies, so I don't think it's just assembly errors.