Genome Annotation Strategies
9.3 years ago

I would like to discuss the type of strategies you use for whole genome annotation. I am not restricting the question to the organisms I am currently working on, but I have been working on herpesviruses, which have quite small genes (from 100bp to 10,000bp) and only a few contain introns. Fortunately, contrary to small RNA viruses, in herpesviruses (dsDNA), few genes are overlapping.

I have been using a home-made procedure based on Cdd domain detection in all 6 frames which I find quite effective. And have also been looking at annotation softwares such as Augustus or GeneMark.

What is your experience in genome annotation and what are the tools/strategies you would favor?

Thanks in advance for the discussion.

Our lab developed Maker and Marker2. I don't know if Maker has been used to annotate viruses. What challenges exist in viral annotation beside overlapping ORFs?

Here are some of the challenges in this field: some very small ORFs (100bp), some very specific genes with no homologues.

These days I'am working with the Augustus. If you hv some info such as: gene structure of closely related specie, RNA-seq/EST data, protein sequences. Augustus will be fine, although a little outdated

In my current project, which is on a non-model organism, I have none of these...

From your experience, would you have suggestions of ab initio annotation algorithms?

Hello group,

I have predicted set of genes for large eukaryotic genome from ab initio (GlimmerHMM) and evidence Exonerate & BLASTX based approaches, now i want to make consensus gene sets from both the predictions. initially I was facing issue with gff files loading into mysql database due to permission issues, but I fixed that issue and loaded gff files into Bio::DB::GFF compatible MySQL using bp_load_gff.pl as glean supports Bio::DB::GFF gff2 files. I tried to run the software with the following command, but getting error.

./glean-lca --param trial.yaml --database dbname --user root --password root123 > output.dat


GFF2 files loaded to Bio::DB::GFF using bp_load_gff.pl

./glean-lca --param ../data/trial.yaml --database dbname --user root --password root123 > new2.dat
UNIVERSAL->import is deprecated and will be removed in a future perl at /data/data/myp/glean-gene/bin/../lib/Glean/MLE.pm line 9.
No reference provided; attempting to analyze entire genome
Estimating parameters for donor sites ...
Can't use an undefined value as an ARRAY reference at /data/data/myp/glean-gene/bin/../lib/Glean/MLE.pm line 265.


I don't know what went wrong, where am I making mistake?

You should post this as a new question, your post is not an answer, and you won't get one this way either.

9.3 years ago
Lhl ▴ 730

it would be good if u can combine de novo gene prediction with external evidence like similarity search against proteomes of close relatives, full length cDNAs, RNAseq and even ESTs.

There are some combiners (e.g. glean, evm, MAKER ...) to merge different predictions.

I do not know any good solution for predicting short genes.

Unfortunately, as this is a non-model organism, I don't have data from close relatives. But thanks for the suggestion and for the combiners you recommend, I'll look into it.