Question

Assembling a chloroplast genome using only PacBio data (Illumina reads lost)

0

Entering edit mode

1 day ago

Serena • 0

Hello everyone,

I’m a master’s student working on a chloroplast genome assembly project. The sequencing data were generated about 4–5 years ago. Unfortunately, the company never provided us with the Illumina raw data — I only have PacBio reads available now.

I’d like to ask:

Is it feasible to assemble a complete chloroplast genome only using PacBio data?

Will the absence of Illumina reads significantly affect assembly quality or downstream analyses (such as gene annotation and comparative genomics)?

Would such a project still be considered substantial enough for a master’s thesis?

For context:

I’m relatively new to bioinformatics.

My lab mainly focuses on classical taxonomy, so I don’t have many local peers familiar with genome assembly.

The dataset is from a plant species (chloroplast genome expected ~150 kb).

Any advice on strategy, software suggestions, or similar experiences would be greatly appreciated.

Thank you very much in advance!

chloroplast genome-assembly • 173 views

ADD COMMENT • link updated 14 hours ago by shelkmike ★ 1.8k • written 1 day ago by Serena • 0

0

Entering edit mode

Can you find related chloroplast genomes (there must be some in NCBI) and try aligning the data your have?

It may be possible to do the assembly with what you have. shelkmike may have some suggestions.

ADD REPLY • link 1 day ago by GenoMax 154k

score 0 · Answer 1 · 2025-10-31

The answer depends on whether you have PacBio HiFi reads (that have high accuracy) or PacBio CLR reads (that have low accuracy). For HiFi reads, polishing with short reads is not required. For CLR reads, you'll have approximately one error per 10 kbp (Supplementary Table 5 in https://www.nature.com/articles/s41586-021-03451-0), and these errors will mostly be indels, so you'll have frameshifts in genes. Therefore, for CLR reads polishing with short reads is necessary.

There are several tools specifically made for the assembly of chloroplast and mitochondrial genomes from long reads, namely:

Oatk (https://github.com/c-zhou/oatk)
TIPP (https://github.com/Wenfei-Xian/TIPP)
HiMT (https://github.com/tang-shuyuan/HiMT)
PMAT (https://github.com/bichangwei/PMAT)
Also, you can do an assembly with a general-purpose long-read assembler, Flye being a good choice (https://github.com/fenderglass/Flye), and then give the assembly graph to GetOrganelle (https://github.com/Kinggerm/GetOrganelle), which will find the chloroplast genome in the graph and make a FASTA file with the genome.

Then, you should reorient the chloroplast genome such that it becomes LSC-IR-SSC-IR.
After that, you can annotate the chloroplast genome with GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html) or PGA (https://github.com/quxiaojian/PGA).
If I were doing this, the whole process would have required less than 10 hours. Therefore, I don't think it is suitable for a Master's thesis. However, you can also assemble the mitochondrial genome, annotate it, and do some basic analyses of the chloroplast and mitochondrial genomes. Together, this will be a decent Master's thesis.