Question

Feasibility of detecting PCR-chimeric reads with Machine Learing (ML) for organelle genome assemblies

0

Entering edit mode

15 hours ago

moreDanOne • 0

hello everyone !! im a senior compsci student currently doing an undergrad thesis, and i'd love to get some insights, especially on the biology aspect of it, as i have very limited knowledge on bio (i only had a bioinformatics internship, for context)

the problem im trying to tackle: in some organelle genome assemblies (especially mitochondrial or chloroplast), PCR-chimeric reads can slip through and cause failed or messy assemblies (using mitobim and getorganelle). a bioinformatician we talked to mentioned that in most of their datasets, certain samples failed to assemble largely because of these chimeric reads.

i'm exploring a machine-learning-based detector for chimeric reads at the raw-read level, instead of relying only on downstream alignment filters. my current idea is to use a supervised classifier with shallow, interpretable sequence-based features, such as:

Split-alignment counts or discordant mapping patterns against a draft reference or organelle DB
k-mer frequency profiles (short-word distributions)
GC-content discontinuities within a read
Possibly local sequence complexity or entropy measures

Id love to hear from the community:

does this approach sound technically feasible with typical illumina-type short reads?
are there existing datasets with validated chimeric vs clean reads we could train on, or would we need to simulate chimeras in silico?
any advice on the most informative features to start with, or pitfalls we should watch out for (like distinguishing true structural variants vs artifacts)?

ML assembly chimeric mitochondria • 327 views

ADD COMMENT • link updated 4 hours ago by andres.firrincieli 3.9k • written 15 hours ago by moreDanOne • 0

andres.firrincieli · Answer 1 · 2025-09-29

0

Entering edit mode

6 hours ago

teamardigen • 0

Your idea is definitely feasible on Illumina reads. Since curated chimera datasets are rare, most people simulate them in silico (concatenating random read fragments), though you can also mine failed assemblies for examples. Split-alignments and k-mer profiles will likely be your strongest features, while GC/entropy shifts can help but may confuse true biological variation. The main pitfall is overcalling—be careful to separate artifacts from genuine rearrangements. Starting with simulations + a simple, interpretable classifier is a solid proof-of-concept path.

source: chatgpt

ADD COMMENT • link updated 4 hours ago by andres.firrincieli 3.9k • written 6 hours ago by teamardigen • 0

0

Entering edit mode

While there is no specific rule against the use of ChatGPT or similar, this must be clearly stated in the answer

ADD REPLY • link 4 hours ago by andres.firrincieli 3.9k