Question

How do I generate synthetic 16S RNA gene sequences?

1

Entering edit mode

3 months ago

Max Sikolenko ▴ 80

Hi!

Seems, I am in need of some advice.

I intend to generate synthetic sequences of 16S rRNA genes to augment a sequence data set for machine learning.

First of all, I don’t think it is a good idea to solve the problem using mere post-vectorization oversampling (such as SMOTE). The reason is in the nature of DNA/RNA sequences: they have intrinsic constraints (such as conservative regions) that can hardly be captured by linear approaches like SMOTE. Am I right with this assumption?

So, I am looking for a method to generate synthetic 16S rRNA gene sequences that would regard these intrinsic constraints. I believe, such constraints can be defined by covariance models (CMs), HMMs or, maybe, mutation profiles.

Of course, Infernal’s cmemit program seems like such a method, and a ready-to-use tool, too. However, it’s results are quite unexpected. My test was simple:

I sampled 100 Pseudomonas 16S rRNA sequences from RiboGrove database;
then built a CM from them;
and finally generated synthetic sequences using cmemit.

However, the generated synthetic sequences are very dissimilar compared to the original real-world sequences. You can see the MSA on the image below. The synthetic sequences are below the orange line, and the real-world sequences are above. The synthetic sequences by no means resemble original real-world 16S rRNA sequences, I don’t even bother measuring identities. My commands for the simple test are below the image just in case.

The cmemit program has no options that would control the level of sequence dissimilarity (if such control is possible at all).

Well, now my ideas are over, unfortunately. If someone could help me with some advice, I would be super grateful.

Alignment of real-world and synthetic sequences

# Commands for my simple test

## Select 100 real-world Pseudomonas sequences from RiboGrove
seqkit grep -nrp ';g__Pseudomonas;'\
    ribogrove_25.231_sequences.fasta.gz \
    | seqkit sample -n 100 \
    | seqkit seq --dna2rna \
    > some_Pseudomonas_16S_seqs.fasta

## Create an MSA with structural annotation
cmalign \
    --cpu 6 \
    /mnt/data/Max/tmp/emit_seqs_tyk/RF00177.cm \
    some_Pseudomonas_16S_seqs.fasta \
    > some_Pseudomonas_16S_seqs.sto

## Create a CM
cmbuild \
    some_Pseudomonas_16S_seqs_from_CM.cm \
    some_Pseudomonas_16S_seqs.sto

## Generate 15 synthetic sequences from the build CM
cmemit \
    -N 15 \
    some_Pseudomonas_16S_seqs_from_CM.cm \
    > some_Pseudomonas_16S_seqs_from_CM_emitted.fasta

## Combine real-world and synthetic sequences in a single fasta file
{
    cat some_Pseudomonas_16S_seqs.fasta;
    cat some_Pseudomonas_16S_seqs_from_CM_emitted.fasta;
} | seqkit seq --rna2dna \
> some_Psm_16S_seqs_from_CM_emitted_together.fasta

## Perform MSA
mafft \
    --thread 6 \
    some_Psm_16S_seqs_from_CM_emitted_together.fasta \
    > some_Psm_16S_seqs_from_CM_emitted_together.afa

16S sequence rRNA Infernal augmentation data oversampling cmemit • 893 views

ADD COMMENT • link updated 11 weeks ago by shelkmike ★ 1.8k • written 3 months ago by Max Sikolenko ▴ 80

2

Entering edit mode

3 months ago

Mensur Dlakic ★ 30k

Why do you need to augment 16S rRNA sequences, given that tens of millions are already available? If your main reason is sequence diversity, I will remind you that there are biological reasons why 16S rRNA sequence are generally not very diverse.

First of all, I don’t think it is a good idea to solve the problem using mere post-vectorization oversampling (such as SMOTE).

I don't think SMOTE works well even in ordinary ML applications, so it is a safe assumption that it won't work in this case.

Covariance models preserve both sequence and structural features. They are famous for being able to detect RNA sequences that have low sequence identity to the consensus model. Therefore, I am not surprised you are getting sequences that are significantly different from the starting group in terms of sequence identity.

If you still wish to pursue this idea - I advise against it - my suggestion is to try RNA generative models:

https://www.nature.com/articles/s41467-024-54812-y

ADD COMMENT • link 3 months ago by Mensur Dlakic ★ 30k

0

Entering edit mode

Why do you need to augment 16S rRNA sequences, given that tens of millions are already available? If your main reason is sequence diversity, I will remind you that there are biological reasons why 16S rRNA sequence are generally not very diverse.

I deal with class imbalance in 16S rRNA gene copy number (GCN) prediction. Prokaryotes with high 16S rRNA GCN are rare, hence the imbalance.

They [CMs] are famous for being able to detect RNA sequences that have low sequence identity to the consensus model. Therefore, I am not surprised you are getting sequences that are significantly different from the starting group in terms of sequence identity.

Quite reasonably...

If you still wish to pursue this idea - I advise against it - my suggestion is to try RNA generative models

Thank you for the suggestion!

ADD REPLY • link 3 months ago by Max Sikolenko ▴ 80

score 4 · Accepted Answer · 2025-07-28

4

Entering edit mode

3 months ago

shelkmike ★ 1.8k

Maybe, the following is worth doing:
1) First, you need to understand how much the natural sequences differ from the covariance model. To do this, you should compare each natural sequence with the covariance model using Infernal, and look at the bit score. Then, calculate the mean bit score and the standard deviation of bit scores.
2) Generate sequences with cmemit, reducing the value of "--exp", so that the generated sequences are more similar to natural sequences.
3) Retain only those generated sequences whose bit scores differ from the mean of the natural sequences by no more than, for example, two standard deviations.

ADD COMMENT • link 3 months ago by shelkmike ★ 1.8k

1

Entering edit mode

Pretty reasonable. Thank you! The only correction to be added is that one should increase the value of --exp to increase similarity of sequences being generated. According to the Infernal Userguide:

--exp <x> [...] With <x> less than 1.0 the emitted sequences will tend to have lower bit scores upon alignment to the CM. With <x> greater than 1.0, the emitted sequences will tend to have higher bit scores upon alignment to the CM.