Hi!
Seems, I am in need of some advice.
I intend to generate synthetic sequences of 16S rRNA genes to augment a sequence data set for machine learning.
First of all, I don’t think it is a good idea to solve the problem using mere post-vectorization oversampling (such as SMOTE). The reason is in the nature of DNA/RNA sequences: they have intrinsic constraints (such as conservative regions) that can hardly be captured by linear approaches like SMOTE. Am I right with this assumption?
So, I am looking for a method to generate synthetic 16S rRNA gene sequences that would regard these intrinsic constraints. I believe, such constraints can be defined by covariance models (CMs), HMMs or, maybe, mutation profiles.
Of course, Infernal’s cmemit
program seems like such a method, and a ready-to-use tool, too. However, it’s results are quite unexpected. My test was simple:
- I sampled 100 Pseudomonas 16S rRNA sequences from RiboGrove database;
- then built a CM from them;
- and finally generated synthetic sequences using
cmemit
.
However, the generated synthetic sequences are very dissimilar compared to the original real-world sequences. You can see the MSA on the image below. The synthetic sequences are below the orange line, and the real-world sequences are above. The synthetic sequences by no means resemble original real-world 16S rRNA sequences, I don’t even bother measuring identities. My commands for the simple test are below the image just in case.
The cmemit
program has no options that would control the level of sequence dissimilarity (if such control is possible at all).
Well, now my ideas are over, unfortunately. If someone could help me with some advice, I would be super grateful.
# Commands for my simple test
## Select 100 real-world Pseudomonas sequences from RiboGrove
seqkit grep -nrp ';g__Pseudomonas;'\
ribogrove_25.231_sequences.fasta.gz \
| seqkit sample -n 100 \
| seqkit seq --dna2rna \
> some_Pseudomonas_16S_seqs.fasta
## Create an MSA with structural annotation
cmalign \
--cpu 6 \
/mnt/data/Max/tmp/emit_seqs_tyk/RF00177.cm \
some_Pseudomonas_16S_seqs.fasta \
> some_Pseudomonas_16S_seqs.sto
## Create a CM
cmbuild \
some_Pseudomonas_16S_seqs_from_CM.cm \
some_Pseudomonas_16S_seqs.sto
## Generate 15 synthetic sequences from the build CM
cmemit \
-N 15 \
some_Pseudomonas_16S_seqs_from_CM.cm \
> some_Pseudomonas_16S_seqs_from_CM_emitted.fasta
## Combine real-world and synthetic sequences in a single fasta file
{
cat some_Pseudomonas_16S_seqs.fasta;
cat some_Pseudomonas_16S_seqs_from_CM_emitted.fasta;
} | seqkit seq --rna2dna \
> some_Psm_16S_seqs_from_CM_emitted_together.fasta
## Perform MSA
mafft \
--thread 6 \
some_Psm_16S_seqs_from_CM_emitted_together.fasta \
> some_Psm_16S_seqs_from_CM_emitted_together.afa
Pretty reasonable. Thank you! The only correction to be added is that one should increase the value of
--exp
to increase similarity of sequences being generated. According to the Infernal Userguide:Thank you, you're right