Question: What is the best way to validate coding sequences (CDS) data from public databases ?
Hii ,

Iam currently working on   mammalian  CDS  data downloaded from Ensembl83. Since I wanted to be sure I am working   with "real CDS" sequences , I have  done gene prediction  using CDS data as input  and   AUGUSTUS software with human data acting as the training set. Surprisingly for few species which I have   done  the prediction  I am getting less coding sequences than what ensemble 8 .For instance for in genome I downloaded roughly 20,000 coding sequences from Ensemble ,on running AUGUSTUS I got 18,000 genes.  

Iam wondering if there  is  a best way to validate the CDS data from public databases apart from using gene prediction tools?  A difference of more than 4000 genes  does not make sense to me  as such.


