Question

Best Genome Assembler and Genome Annotation tools and pipelines

1

Entering edit mode

4.8 years ago

margab ▴ 10

I want to assemble and annotate the kiwi (bird) genome. What is the best genome assemble tool and genome annotation pipeline I can use?

We have 11 libraries with several insert sizes from Apteryx mantelli genomic DNA and sequenced 83 billion base pairs (Gb) from small insert-size libraries and 120 Gb from large-insert mate-pair Illumina libraries. The kiwi's genome size is about 1.6 Gb. The assembled contigs and scaffolds cover approximately 96% of the complete genome with an average sequence coverage of 35.85-fold after correction.

The ones I have found from my research are MaSuRCA, Platanus, ALLPATHS-LG and ABySS for the genome assembly and BRAKER2, MAKER and CAT pipelines for genome annotation. For the de novo gene prediction and annotation we can also provide 47.5 Gb of transcript sequence data from kiwi embryonic tissue together with the de novo gene predictions and protein evidence from three well-annotated bird species.

Thank you in advance.

My goal is to re-assemble and re-annotate the genome of kiwi from the sequencing data provided by this article: DOI 10.1186/s13059-015-0711-4 I want to use new tools and pipelines in order to increase the efficiency of the assembly and the annotation.

annotation assembly genome birds • 3.9k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.8 years ago by margab ▴ 10

0

Entering edit mode

little confused about your question: it sounds like you already have lots of things done already (eg. you seem to already have an assembly). So what's the goal of looking for other software, (what have you used so far btw), are you not satisfied with the current results?

ADD REPLY • link 4.8 years ago by lieven.sterck 15k

0

Entering edit mode

My goal is to re-assemble and re-annotate the genome of kiwi from the sequencing data provided by this article: DOI 10.1186/s13059-015-0711-4 I want to use new tools and pipelines in order to increase the efficiency of the assemble and the annotation. In this paper they used Soapdenovo2 and MAKER and Augustus.

ADD REPLY • link 4.8 years ago by margab ▴ 10

score 4 · Answer 1 · 2019-07-17

There is no "best" assembler for all datasets. I would recommend Soap2denovo to start with. Abyss is also good though produced shorter but highly accurate contigs in my experience. Allpaths LG requires particular data I believe, so I haven't used it.

Two comments:

the mate-pairs are critical to your analysis. Please check and remove duplicates from these data, they are known to have a very high duplicate content.
Long reads are far, far better than short reads for assembly. Why aren't you using these ? An assembly would be greatly improved by adding a couple of minion or promethion/pacbio runs from a decent service provider to make long and unfragmented contigs.

Also, a kiwi genome has already been sequenced and assembled in a highly fragmented fashion several years ago, maybe these data are useful.

score 2 · Answer 2 · 2019-07-21

For de-novo genome assembly, you may wish to try the BIRCH system, which runs many of the popular assembly steps including read quality checking, trimming, error correction, and de-novo assembly using SOAPdenovo2, Spades or ABySS, and generates reports on assembly quality using Quast. All of these steps can be done using our BioLegato graphical interface. See the tutorial at http://home.cc.umanitoba.ca/~psgendb/birchhomedir/BIRCHDEV/public_html/tutorials/bioLegato/genome_assembly/genome.html, and a video demonstrating how BioLegato makes it easy to do these tasks at https://www.youtube.com/watch?v=56T05sOcODI.

score 2 · Answer 3 · 2021-05-27

Hello,

You have an interesting problem here. I completely agree with colindaven on what he said about trying multiple assembling approaches. With short-read data it is highly likely that you will end up with a fragmented assembly. But you can annotate the fragmented assembly and still get a good number of genes and transcripts. Since you have RNA-Seq data I would recommend that you use that RNA-Seq to perform scaffolding. I have provided a short step-by-step process that might help produce longer contigs:

Assemble the short-read DNA-Seq using more than one assembler (MaSuRCA, Platanus, ALLPATHS-LG and ABySS)
Use some scaffolders to generate scaffolds from assembled contigs. (OPERA-LG, InGAP, iLSLS, AGOUTI, Rascaf [Requires RNA-Seq reads] etc.)
Compute some statistics on the scaffolds to get an idea about which are results are poor or better than the other. You might be able to discard results from a few of the assemblers and scaffolders but I don't think you will be able to select the best assembly just by looking at the stats. But there is a way to find the "best"!!

Once you have a few assemblies try annotating those with a genome annotating software. You have mentioned BRAKER and MAKER but let me introduce you to FINDER. It is a state-of-the-art genome annotator that automates all the tasks of genome annotation. You can attempt to annotate each of the assemblies using the RNA-Seq data and then verify which assembly gives you the best results. FINDER reports protein coding genes too. So you will be able to run BUSCO on the genes to assess which assembly is better. FINDER runs BRAKER within to enrich the set of genes with predictions. You can access the paper from FINDER and the software is in Github

Thank you.