Question: Strategy for genome annotation for a new individual of a well-established species
gravatar for liorglic
2.9 years ago by
liorglic340 wrote:

I am new to the topic of genome annotation and would like to get some advice regarding my planned strategy.
So, I am exploring genetic diversity in maize, in for this purpose I have de-novo assembled a genome of a variant which is supposed to be quite diverse from the reference. I'd now like to annotate the assembly. The strategy I have thought about is as follows:
Since maize has quite a lot of genomic resources, I don't see a reason to go for ab-initio annotation as a first attempt. Rather, I would like to base my annotation upon the existing annotation of the maize reference (B73) and a collection of transcripts which I have collected from previous publications. Unfortunately, I do not have RNA-Seq data coming from the individual I am annotating. I combined multiple transcripts sets and the official annotation and aligned them to my assembled sequence. However, result seems rather noisy, with ~150k predicted genes, which sounds too much. I am wondering what should be my next step. Maybe I should filter my transcripts set to reduce noise? I have already tried filtering out very short transcripts, but can I do something more sophisticated? Another option I have thought of is putting the transcripts through some clustering algorithm (e.g. OrthoMCL) and then take representative transcripts from each cluster and maybe remove singletons.
Is there a common way to assess the integrity of specific gene annotation? Maybe this could help me remove pseudo-genes and/or random alignments from my annotation results?
Have anyone here done something like this before? Would appreciate any thoughts or advice on the strategy I described here.
Thank you!

ADD COMMENTlink modified 2.9 years ago by Rohit1.4k • written 2.9 years ago by liorglic340

You can take a look at RATT which is now part of PAGIT.

ADD REPLYlink written 2.9 years ago by GenoMax94k

Thanks, I wasn't familiar with this software. However, since I'm also interested in detecting new genes not present in the reference annotation, I'd like to use the transcriptomic data on top of that, and this is where most of the noise comes from.

ADD REPLYlink written 2.9 years ago by liorglic340
gravatar for Rohit
2.9 years ago by
Rohit1.4k wrote:

PASA has a pipeline where you can use existing information to improve the annotations. Gemoma also provides a workflow to use annotations from multiple references (not entirely transcript based) to annotate a target genomes. Probably the reason for ~150k predicted genes is you have isoforms indicated separately, where you would have to collapse the gene boundaries. Also, many of the predicted annotations can be pseudo-genes or lncRNAs, but separating these from genes is another problem altogether :|

ADD COMMENTlink written 2.9 years ago by Rohit1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2439 users visited in the last hour