Question

Genome assambly and annotation for isolated bacteria

0

Entering edit mode

10 weeks ago

m90 ▴ 30

Hello everyone,

I have isolated bacterial data, and now I want to perform assembly and annotation. For the first pipeline, I used MEGAHIT for assembly and QUAST for quality assessment. Then, I used BUSCO to obtain the single-copy genes. BUSCO provided me with a list of genes from my assembly along with their protein sequences, but it does not perform the annotation to identify what these genes are.

I have a few questions:

Is there a tool that can take the list of genes from BUSCO to perform annotation, or do I need to do it manually?
When using SPAdes with short reads, and if it indicates that the genome is not complete, how can I address this issue?
Is there a reference-based assembly tool that I can use for assembly?
I need to a pipeline for assembly, annotation, and gene prediction for the isolated bacteria.

Thank you!

reference DNA WGS bacteria annotiton • 6.4k views

ADD COMMENT • link updated 3 days ago by Kevin Blighe ★ 90k • written 10 weeks ago by m90 ▴ 30

4

Entering edit mode

I'd suggest looking at shovill and prokka from Torsten Seeman (assuming you have illumina short reads).

ADD REPLY • link 10 weeks ago by Joe 22k

2

Entering edit mode

Some comments rather than an answer.

Is there a tool that can take the list of genes from BUSCO to perform annotation, or do I need to do it manually?

BUSCO only checks for universal single copy orthologs. Not full gene annotation. You'll need to find one of the many tools for genome annotation available. Check publications of recent bacterial references to find relevant ones.

Is there a reference-based assembly tool that I can use for assembly?

Yes, there are a few, but they come with some pretty major caveats. For example, any mis-assembly in the reference and any major genomic rearrangements will be inherited from the reference. Usually, this is not a good idea but works in some use cases.

I need to a pipeline for assembly, annotation, and gene prediction for the isolated bacteria.

See comment above about finding tools in recent relevant publications.

ADD REPLY • link 10 weeks ago by dthorbur ★ 3.1k

score 0 · Answer 1 · 2025-11-19

For your first question regarding annotation of the genes identified by BUSCO: BUSCO assesses genome completeness using single-copy orthologs but does not provide functional annotation. There is no dedicated tool that directly uses BUSCO output for annotation. Instead, annotate the entire assembly with a tool such as Prokka, which predicts genes and assigns functions via database searches. This will cover the BUSCO-identified genes as part of the process. Alternatively, for specific protein sequences from BUSCO, perform individual annotations using BLAST against UniProt or InterProScan.

prokka --outdir my_annotation --prefix my_genome --kingdom Bacteria assembly.fasta

For your second question on addressing an incomplete genome from SPAdes with short reads: Incomplete assemblies often result from low coverage, poor read quality, or repetitive regions. Increase sequencing depth to at least 50x if possible. Trim and filter reads using Trimmomatic to remove adapters and low-quality bases. If long reads are available, switch to a hybrid assembler like Unicycler, which combines short and long reads to resolve gaps.

unicycler -1 short_R1.fastq -2 short_R2.fastq -l long_reads.fastq -o output_dir

For your third question on reference-based assembly tools: Yes, tools exist for reference-guided assembly of bacterial genomes. Ragout is suitable for scaffolding contigs using one or more reference genomes. Rebaler works well with long reads aligned to a reference via minimap2, followed by consensus polishing.

For your fourth question on a pipeline for assembly, annotation, and gene prediction: Use this sequence for isolated bacterial data with short reads. First, quality control with FastQC and Trimmomatic. Assemble with SPAdes or MEGAHIT. Assess quality with QUAST and BUSCO. Annotate with Prokka, which includes gene prediction via Prodigal. For more comprehensive annotation, submit to NCBI's PGAP. The nf-core/bacass pipeline automates much of this process if you have Nextflow installed.

Kevin