I have downloaded from TSA (Transcriptome Shotgun Assembly) the contig sequences of the same species but from two different BioProject (same authors, but different studies). One file contains ~800,000 sequences while the other has ~400,000 sequences.
I'm interested in identifying protein-coding regions and I'm using TransDecoder for that purpose. After running TransDecoder I have gotten ~300,000 and ~150,000 protein-coding regions, respectively. I'm aware that TransDecoder looks for possible ORF in all 6 reading frames, and so the initial amount of contig sequences is possibly correlated with the final number of proteins.
However, I'm wondering how can one infer the "true" (i.e. closest to reality) set of protein-coding regions for a species? For example, the proteome of Xenopus tropicalis contains right now 39,662 sequences (or mRNAs as stated here) and Anolis carolinensis 32,230. So why do I get so many proteins and how can I get a more realistic number?