I am new to the topic of genome annotation and would like to get some advice regarding my planned strategy.
So, I am exploring genetic diversity in maize, in for this purpose I have de-novo assembled a genome of a variant which is supposed to be quite diverse from the reference. I'd now like to annotate the assembly. The strategy I have thought about is as follows:
Since maize has quite a lot of genomic resources, I don't see a reason to go for ab-initio annotation as a first attempt. Rather, I would like to base my annotation upon the existing annotation of the maize reference (B73) and a collection of transcripts which I have collected from previous publications. Unfortunately, I do not have RNA-Seq data coming from the individual I am annotating. I combined multiple transcripts sets and the official annotation and aligned them to my assembled sequence. However, result seems rather noisy, with ~150k predicted genes, which sounds too much. I am wondering what should be my next step. Maybe I should filter my transcripts set to reduce noise? I have already tried filtering out very short transcripts, but can I do something more sophisticated? Another option I have thought of is putting the transcripts through some clustering algorithm (e.g. OrthoMCL) and then take representative transcripts from each cluster and maybe remove singletons.
Is there a common way to assess the integrity of specific gene annotation? Maybe this could help me remove pseudo-genes and/or random alignments from my annotation results?
Have anyone here done something like this before? Would appreciate any thoughts or advice on the strategy I described here.
PASA has a pipeline where you can use existing information to improve the annotations. Gemoma also provides a workflow to use annotations from multiple references (not entirely transcript based) to annotate a target genomes. Probably the reason for ~150k predicted genes is you have isoforms indicated separately, where you would have to collapse the gene boundaries. Also, many of the predicted annotations can be pseudo-genes or lncRNAs, but separating these from genes is another problem altogether :|