Feasibility of detecting PCR-chimeric reads with Machine Learing (ML) for organelle genome assemblies
1
0
Entering edit mode
15 hours ago
moreDanOne • 0

hello everyone !! im a senior compsci student currently doing an undergrad thesis, and i'd love to get some insights, especially on the biology aspect of it, as i have very limited knowledge on bio (i only had a bioinformatics internship, for context)

the problem im trying to tackle: in some organelle genome assemblies (especially mitochondrial or chloroplast), PCR-chimeric reads can slip through and cause failed or messy assemblies (using mitobim and getorganelle). a bioinformatician we talked to mentioned that in most of their datasets, certain samples failed to assemble largely because of these chimeric reads.

i'm exploring a machine-learning-based detector for chimeric reads at the raw-read level, instead of relying only on downstream alignment filters. my current idea is to use a supervised classifier with shallow, interpretable sequence-based features, such as:

  • Split-alignment counts or discordant mapping patterns against a draft reference or organelle DB
  • k-mer frequency profiles (short-word distributions)
  • GC-content discontinuities within a read
  • Possibly local sequence complexity or entropy measures

Id love to hear from the community:

  1. does this approach sound technically feasible with typical illumina-type short reads?
  2. are there existing datasets with validated chimeric vs clean reads we could train on, or would we need to simulate chimeras in silico?
  3. any advice on the most informative features to start with, or pitfalls we should watch out for (like distinguishing true structural variants vs artifacts)?
ML assembly chimeric mitochondria • 327 views
ADD COMMENT
0
Entering edit mode
6 hours ago

Your idea is definitely feasible on Illumina reads. Since curated chimera datasets are rare, most people simulate them in silico (concatenating random read fragments), though you can also mine failed assemblies for examples. Split-alignments and k-mer profiles will likely be your strongest features, while GC/entropy shifts can help but may confuse true biological variation. The main pitfall is overcalling—be careful to separate artifacts from genuine rearrangements. Starting with simulations + a simple, interpretable classifier is a solid proof-of-concept path.

source: chatgpt

ADD COMMENT
0
Entering edit mode

While there is no specific rule against the use of ChatGPT or similar, this must be clearly stated in the answer

ADD REPLY

Login before adding your answer.

Traffic: 6232 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6