Question: Best strategy for gene annotation for de novo genome assembly without RNA-seq data
1
gravatar for soniabedi.07
7 months ago by
soniabedi.0710
soniabedi.0710 wrote:

I wanted to know the best method/pipeline for gene annotation after draft genome assembly for gene annotation. Using BLAST is one way to go. Is there any tool (with good accuracy) that can annotate the genome?? Thank you in advance.

(My draft assembly is done using Masurca and SPades with paired-end reads, mate-pair reads and pacbio reads.)

assembly gene • 701 views
ADD COMMENTlink modified 7 months ago by Juke-342.9k • written 7 months ago by soniabedi.0710

Would you mind adding the species you are working on please. Thank you

ADD REPLYlink written 7 months ago by Bastien Hervé4.5k

It is a plant genome of roughly 2-2.5 GB.

ADD REPLYlink written 7 months ago by soniabedi.0710
5
gravatar for jean.elbers
7 months ago by
jean.elbers1.3k
jean.elbers1.3k wrote:

I am assuming this is a Eukaryotic species. Depending on the specific species, you could use related-species' protein evidence as input along with a BUSCO-trained custom Augustus species and MAKER (http://www.yandell-lab.org/software/maker.html) to predict protein-coding genes. If there is a related, well-annotated species (ideally from ENSEMBL), you could use the Comparative Annotation Toolkit (CAT, https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit). You could also run MAKER first and then align the predicted cDNA transcripts to your de novo genome and use that as input for Augustus PB in the CAT workflow.

edit: changed predicted proteins to cDNA transcripts

ADD COMMENTlink modified 7 months ago • written 7 months ago by jean.elbers1.3k

Hi. Thank you for your input, it really sound helpful. Is it possible to attach a link of any paper which I could read and understand in detail for the above. ??

ADD REPLYlink written 7 months ago by soniabedi.0710

I don't know of any publications that describe this method specifically. Below is an unpublished methods section with the species concealed as "new rodent genome"


We first annotated new rodent genome scaffolds greater than 10 Kbp with MAKER v. 2.31.10 (Cantarel et al., 2008; Holt & Yandell, 2011). For the single MAKER run, we masked repetitive regions with RepeatMasker v. open-4.0.7 (http://www.repeatmasker.org) against the entire Dfam_Consensus release 20170127 database and used a new rodent genome specific repeat library created with RepeatModeler v. open-1.0.10 (http://www.repeatmasker.org) with the new rodent genome assembly as input. We filtered the repeat library from RepeatModeler to remove known UniProt/SwissProt v. 2019_01 (Boutet et al., 2016) proteins using ProtExcluder v. 1.1 (Campbell et al., 2014).

For the MAKER run, we included ab initio gene predictions from Augustus v. 3.3.2 (Stanke et al., 2006) trained with BUSCO v. 3.0.2 (Simão et al., 2015) using Eukaroyota OrthoDB v. 9.1 genes (Zdobnov et al., 2017) and ab initio gene predictions from GeneMark-ES v. 4.38 (Lomsadze, 2005). We also included predicted proteins Mus musculus and Rattus norvegicus (GenBank accessions [NCBI annotation release]: GCF_000001635.26 [106] and GCF_000001895.5 [106], respectively).

After the MAKER run finished, we only retained genes, transcripts, and proteins with annotation edit distance (AED) ≤ 0.50. We predicted putative gene function with blastp v. 2.2.31+ (Altschul, 1990) searches against the UniProt/SwissProt v. 2019_01 database using an E value cutoff of 1e-6 and assigned protein domains and gene ontology terms using InterProScan v. 5.32.71.0 (Jones et al., 2014).

After annotating the new rodent genome genome with MAKER, we transferred annotations from ENSEMBL 95’s annotation of GRCm38 (Genome Resource Consortium M. musculus genome assembly version 38) to the new rodent genome with the Comparative Annotation Toolkit (CAT, Fiddes et al., 2018). Briefly, we repeat masked the new rodent genome and GRCm38 with RepeatMasker v. open-4.0.8 (http://www.repeatmasker.org) against the mammal repeats from RepBase RepeatMaskerEdition-20181026 (Jurka et al., 2005). We then used default settings in Progressive Cactus (Paten et al., 2011; Paten et al., 2011) to generate a HAL (hierarchical alignment format) alignment between GRCm38 and the new rodent genome. For running CAT, we used Augustus with the setting “--augustus-utr-off” and the Augustus species being the same as that used during the MAKER run. When running CAT, we also used Augustus PB, whereby we generated synthetic long-read alignments between MAKER predicted cDNA transcript bases given a fake Phred quality score of Q40 with SeqTK v. 1.2-r102-dirty (https://github.com/lh3/seqtk) and the new rodent genome genome using Minimap2 v. 2.16 (Li, 2018) with the “-ax splice -uf -C5” settings.

ADD REPLYlink modified 7 months ago • written 7 months ago by jean.elbers1.3k

Thank you. This is of great help.

ADD REPLYlink written 7 months ago by soniabedi.0710

@soniabedi.07 I might be able to offer assistance in this annotation if you are interested. I would need a link to the genome (Dropbox link would be great) and obviously what species it belongs to. You can email me these details if you are interested by clicking on the email link on my profile at https://www.vetmeduni.ac.at/de/fiwi/beruns/personen/

ADD REPLYlink written 7 months ago by jean.elbers1.3k

Hmmm...I didn't realize this was a plant genome. I don't have direct experience with MAKER (really should be MAKER-P for plants) and CAT with plants.

ADD REPLYlink written 7 months ago by jean.elbers1.3k
1
gravatar for Juke-34
7 months ago by
Juke-342.9k
Sweden
Juke-342.9k wrote:

If you are not experienced in genome annotation and you want minimal effort for a good result I would recommend the first approach from jean.elbers (MAKER with augustus trained by BUSCO with proteins ) or BRAKER2. For BRAKER2 you need first to map a set of protein to your genome (e.g swissprot) and provide the gff to BRAKER2 that will do the rest. Comparative Annotation Toolkit is excellent but not necessarily easy to run...

ADD COMMENTlink modified 7 months ago • written 7 months ago by Juke-342.9k

Okay. Will try this. But I am rather curious to try Comparative Annotation Toolkit too.

ADD REPLYlink written 7 months ago by soniabedi.0710

Which will be a faster process (limited time in hand)?? Annotation by MAKER-P, BRAKER2 or CAT??

Also, for CAT what will be not easy? just trying to understand more about it. Thank you for your valuable input.

ADD REPLYlink written 7 months ago by soniabedi.0710
1

No need to specify MAKER-P, this flavor does not exist by itself anymore, all improvements have been included within MAKER several years ago. Compute time is dependant of many parameters. The longest for CAT is the whole genome alignment, so it will depend of how many close related species (~ <50My divergence) you can/want to use.
For MAKER it will depend of how many lines of evidence (size of the protein set, how many trascriptomes, EST, etc...) you want to align. And also the size of your cluster. Using mpi you can scale MAKER to annotate a 20Gb genome within 24 hours but you will need a huge amount of available compute nodes... For ~1Gb genome with fair amount of evidence lines it take ~1 week for ~80 cpu. BRAKER2 will be fast in comparison because it does not align evidence over the genome. It is abinitio... But I think it is just multithreaded, not paralleled over multiple compute nodes...

ADD REPLYlink modified 7 months ago • written 7 months ago by Juke-342.9k

@Juke-34 - that's an excellent response. Thanks for saying what I didn't have time to write. I didn't realize that about MAKER-P.

ADD REPLYlink modified 7 months ago • written 7 months ago by jean.elbers1.3k

I don’t have experience with BRAKER2, but it will probably run faster than MAKER-P. CAT requires a good, well-annotated reference to transfer the annotations from the well-annotated species to your species. Both MAKER-P and CAT are not that easy to install, but there is a bioconda recipe to install CAT.

ADD REPLYlink written 7 months ago by jean.elbers1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1945 users visited in the last hour