What is the best way to validate coding sequences (CDS) data from public databases ?

0

Entering edit mode

8.2 years ago

chevivien ▴ 90

Hi,

I am currently working on mammalian CDS data downloaded from Ensembl83. Since I wanted to be sure I am working with "real CDS" sequences, I have done gene prediction using CDS data as input and AUGUSTUS software with human data acting as the training set. Surprisingly for few species which I have done the prediction I am getting less coding sequences than what ensemble 8 .For instance for in genome I downloaded roughly 20,000 coding sequences from Ensemble ,on running AUGUSTUS I got 18,000 genes.

I am wondering if there is a best way to validate the CDS data from public databases apart from using gene prediction tools? A difference of more than 4000 genes does not make sense to me as such.

Augustus CDS-prediction Ensembl • 1.8k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by chevivien ▴ 90

Login before adding your answer.