Best strategy for gene annotation for de novo genome assembly without RNA-seq data
2
1
Entering edit mode
2.2 years ago
soniabedi.07 ▴ 10

I wanted to know the best method/pipeline for gene annotation after draft genome assembly for gene annotation. Using BLAST is one way to go. Is there any tool (with good accuracy) that can annotate the genome?? Thank you in advance.

gene assembly • 2.1k views
0
Entering edit mode

Would you mind adding the species you are working on please. Thank you

0
Entering edit mode

It is a plant genome of roughly 2-2.5 GB.

5
Entering edit mode
2.2 years ago
jean.elbers ★ 1.6k

I am assuming this is a Eukaryotic species. Depending on the specific species, you could use related-species' protein evidence as input along with a BUSCO-trained custom Augustus species and MAKER (http://www.yandell-lab.org/software/maker.html) to predict protein-coding genes. If there is a related, well-annotated species (ideally from ENSEMBL), you could use the Comparative Annotation Toolkit (CAT, https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit). You could also run MAKER first and then align the predicted cDNA transcripts to your de novo genome and use that as input for Augustus PB in the CAT workflow.

edit: changed predicted proteins to cDNA transcripts

0
Entering edit mode

Hi. Thank you for your input, it really sound helpful. Is it possible to attach a link of any paper which I could read and understand in detail for the above. ??

0
Entering edit mode

I don't know of any publications that describe this method specifically. Below is an unpublished methods section with the species concealed as "new rodent genome"

We first annotated new rodent genome scaffolds greater than 10 Kbp with MAKER v. 2.31.10 (Cantarel et al., 2008; Holt & Yandell, 2011). For the single MAKER run, we masked repetitive regions with RepeatMasker v. open-4.0.7 (http://www.repeatmasker.org) against the entire Dfam_Consensus release 20170127 database and used a new rodent genome specific repeat library created with RepeatModeler v. open-1.0.10 (http://www.repeatmasker.org) with the new rodent genome assembly as input. We filtered the repeat library from RepeatModeler to remove known UniProt/SwissProt v. 2019_01 (Boutet et al., 2016) proteins using ProtExcluder v. 1.1 (Campbell et al., 2014).

For the MAKER run, we included ab initio gene predictions from Augustus v. 3.3.2 (Stanke et al., 2006) trained with BUSCO v. 3.0.2 (Simão et al., 2015) using Eukaroyota OrthoDB v. 9.1 genes (Zdobnov et al., 2017) and ab initio gene predictions from GeneMark-ES v. 4.38 (Lomsadze, 2005). We also included predicted proteins Mus musculus and Rattus norvegicus (GenBank accessions [NCBI annotation release]: GCF_000001635.26 [106] and GCF_000001895.5 [106], respectively).

After the MAKER run finished, we only retained genes, transcripts, and proteins with annotation edit distance (AED) ≤ 0.50. We predicted putative gene function with blastp v. 2.2.31+ (Altschul, 1990) searches against the UniProt/SwissProt v. 2019_01 database using an E value cutoff of 1e-6 and assigned protein domains and gene ontology terms using InterProScan v. 5.32.71.0 (Jones et al., 2014).

After annotating the new rodent genome genome with MAKER, we transferred annotations from ENSEMBL 95’s annotation of GRCm38 (Genome Resource Consortium M. musculus genome assembly version 38) to the new rodent genome with the Comparative Annotation Toolkit (CAT, Fiddes et al., 2018). Briefly, we repeat masked the new rodent genome and GRCm38 with RepeatMasker v. open-4.0.8 (http://www.repeatmasker.org) against the mammal repeats from RepBase RepeatMaskerEdition-20181026 (Jurka et al., 2005). We then used default settings in Progressive Cactus (Paten et al., 2011; Paten et al., 2011) to generate a HAL (hierarchical alignment format) alignment between GRCm38 and the new rodent genome. For running CAT, we used Augustus with the setting “--augustus-utr-off” and the Augustus species being the same as that used during the MAKER run. When running CAT, we also used Augustus PB, whereby we generated synthetic long-read alignments between MAKER predicted cDNA transcript bases given a fake Phred quality score of Q40 with SeqTK v. 1.2-r102-dirty (https://github.com/lh3/seqtk) and the new rodent genome genome using Minimap2 v. 2.16 (Li, 2018) with the “-ax splice -uf -C5” settings.

0
Entering edit mode

Thank you. This is of great help.

0
Entering edit mode

@soniabedi.07 I might be able to offer assistance in this annotation if you are interested. I would need a link to the genome (Dropbox link would be great) and obviously what species it belongs to. You can email me these details if you are interested by clicking on the email link on my profile at https://www.vetmeduni.ac.at/de/fiwi/beruns/personen/

0
Entering edit mode

Hmmm...I didn't realize this was a plant genome. I don't have direct experience with MAKER (really should be MAKER-P for plants) and CAT with plants.

1
Entering edit mode
2.2 years ago
Juke34 ★ 5.8k

If you are not experienced in genome annotation and you want minimal effort for a good result I would recommend the first approach from jean.elbers (MAKER with augustus trained by BUSCO with proteins ) or BRAKER2. For BRAKER2 you need first to map a set of protein to your genome (e.g swissprot) and provide the gff to BRAKER2 that will do the rest. Comparative Annotation Toolkit is excellent but not necessarily easy to run...

0
Entering edit mode

Okay. Will try this. But I am rather curious to try Comparative Annotation Toolkit too.

0
Entering edit mode

Which will be a faster process (limited time in hand)?? Annotation by MAKER-P, BRAKER2 or CAT??

Also, for CAT what will be not easy? just trying to understand more about it. Thank you for your valuable input.

1
Entering edit mode

No need to specify MAKER-P, this flavor does not exist by itself anymore, all improvements have been included within MAKER several years ago. Compute time is dependant of many parameters. The longest for CAT is the whole genome alignment, so it will depend of how many close related species (~ <50My divergence) you can/want to use.
For MAKER it will depend of how many lines of evidence (size of the protein set, how many trascriptomes, EST, etc...) you want to align. And also the size of your cluster. Using mpi you can scale MAKER to annotate a 20Gb genome within 24 hours but you will need a huge amount of available compute nodes... For ~1Gb genome with fair amount of evidence lines it take ~1 week for ~80 cpu. BRAKER2 will be fast in comparison because it does not align evidence over the genome. It is abinitio... But I think it is just multithreaded, not paralleled over multiple compute nodes...

0
Entering edit mode

@Juke-34 - that's an excellent response. Thanks for saying what I didn't have time to write. I didn't realize that about MAKER-P.

0
Entering edit mode

I don’t have experience with BRAKER2, but it will probably run faster than MAKER-P. CAT requires a good, well-annotated reference to transfer the annotations from the well-annotated species to your species. Both MAKER-P and CAT are not that easy to install, but there is a bioconda recipe to install CAT.