Question

Strategy for genome annotation for a new individual of a well-established species

1

Entering edit mode

6.1 years ago

liorglic ★ 1.4k

Hi,
I am new to the topic of genome annotation and would like to get some advice regarding my planned strategy.
So, I am exploring genetic diversity in maize, in for this purpose I have de-novo assembled a genome of a variant which is supposed to be quite diverse from the reference. I'd now like to annotate the assembly. The strategy I have thought about is as follows:
Since maize has quite a lot of genomic resources, I don't see a reason to go for ab-initio annotation as a first attempt. Rather, I would like to base my annotation upon the existing annotation of the maize reference (B73) and a collection of transcripts which I have collected from previous publications. Unfortunately, I do not have RNA-Seq data coming from the individual I am annotating. I combined multiple transcripts sets and the official annotation and aligned them to my assembled sequence. However, result seems rather noisy, with ~150k predicted genes, which sounds too much. I am wondering what should be my next step. Maybe I should filter my transcripts set to reduce noise? I have already tried filtering out very short transcripts, but can I do something more sophisticated? Another option I have thought of is putting the transcripts through some clustering algorithm (e.g. OrthoMCL) and then take representative transcripts from each cluster and maybe remove singletons.
Is there a common way to assess the integrity of specific gene annotation? Maybe this could help me remove pseudo-genes and/or random alignments from my annotation results?
Have anyone here done something like this before? Would appreciate any thoughts or advice on the strategy I described here.
Thank you!

Assembly annotation transcriptome • 1.3k views

ADD COMMENT • link updated 6.1 years ago by Rohit ★ 1.5k • written 6.1 years ago by liorglic ★ 1.4k

0

Entering edit mode

You can take a look at RATT which is now part of PAGIT.

ADD REPLY • link 6.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks, I wasn't familiar with this software. However, since I'm also interested in detecting new genes not present in the reference annotation, I'd like to use the transcriptomic data on top of that, and this is where most of the noise comes from.

ADD REPLY • link 6.1 years ago by liorglic ★ 1.4k

score 0 · Answer 1 · 2018-03-04

PASA has a pipeline where you can use existing information to improve the annotations. Gemoma also provides a workflow to use annotations from multiple references (not entirely transcript based) to annotate a target genomes. Probably the reason for ~150k predicted genes is you have isoforms indicated separately, where you would have to collapse the gene boundaries. Also, many of the predicted annotations can be pseudo-genes or lncRNAs, but separating these from genes is another problem altogether :|