I'm working on GAWN https://github.com/enormandeau/gawn, a genome annotation pipeline that produces fast results based on evidence from an available transcriptome.
Popular genome annotation pipelines are usually difficult to install and use and take forever to run. That is, when they do not break on random scaffolds. In the last years, I have been looking hard for an easy way to annotate newly assembled genomes. Even something that would provide "good-enough" annotation without ab initio gene prediction.
Last week, a colleague (see acknowledgement at the top of the README.md file in the GitHub repository) suggested we use the splice-aware GMAP aligner to generate basic gene annotations in a GFF3 format. This worked beautifully. He then added that we could use cufflinks and TransDecoder to add UTR regions. I then added some Swissprot based annotation of the transcripts that I propagate to the annotated genes on the genome.
Using GAWN, I annotated 3 eukaryote genomes over the last days (multiple times each as I was developing). Depending on the genome size, annotation takes between 30 minutes and a few hours. This is orders of magnitude faster more standard approaches and tools. The GMAP aligner does however require a good amount of RAM to index genomes. It took 61 Go of RAM to index a 3.7 Gb genome assembly, which is the main drawback. The other limitation is of course the requirement of having a transcriptome available for the same species or, alternatively, for a closely related species.
The output files are:
- A genome annotation GFF3 file
- A transcriptome annotation table
- A genome annotation table
Obviously, I still need to test how the annotation I am getting compares to existing annotation, especially the UTR annotations, but since the annotation is based on the gene and exon position of existing transcripts, which are annotated using Swissprot, I am fairly confident in the approach. I was thinking of trying the approach on the human and the danio genomes.
I am fairly excited about finally being able to annotate genomes rapidly and without nightmares.
Version v0.2, which is currently available, is fully functional. It has been tested on Linux only. It should work on OSX as long as you have the dependencies installed.
Here are the dependencies. The version numbers are the ones that have been tested. It is suggested that you use these or more recent versions, although the pipeline will probably work just fine with some older versions.
- GNU Linux or OSX
- bash 4+
- python 2.7+ (TODO or 3.5+)
- cufflinks v2.2.1+
- wget 1.17.1
- gnu parallel 2017xxxx+
- blastplus utilities (blastx) 2.3.0+
- a local copy of the swissprot database
I'd be glad to have your opinion on the approach, implementation, documentation, bugs, suggestions, etc.
Please chime in!