Iam currently working on mammalian CDS data downloaded from Ensembl83. Since I wanted to be sure I am working with "real CDS" sequences , I have done gene prediction using CDS data as input and AUGUSTUS software with human data acting as the training set. Surprisingly for few species which I have done the prediction I am getting less coding sequences than what ensemble 8 .For instance for in genome I downloaded roughly 20,000 coding sequences from Ensemble ,on running AUGUSTUS I got 18,000 genes.
Iam wondering if there is a best way to validate the CDS data from public databases apart from using gene prediction tools? A difference of more than 4000 genes does not make sense to me as such.