Previous title was "sequence similarity analysis between groups of closely related DNA sequences". Was edited for clarity
I'll start by trying to explain my problem briefly: The data: We got a list of a few hundreds short sequences (pre-miRNA sequences with length of ~100). Then we expanded the list by modifying those sequences (sometimes modifying the whole sequence completely sometimes modifying only a few nucleotides). So now we have a few thousand of short sequences and some of them very similar to each other. We did some experiments with those sequences and we separated them into groups.
The goal: we want to find how different groups of sequences might differ from other groups.
I thought tools for motif analysis might help me but from a quick search it looks like most of them geared towards genome analysis and not for individual sequences.
Any ideas or directions for me to search?
EDIT: additional information
I'll try to add more details to explain the problem more clearly. Here is an example from our fasta:
>Reversed_hsa-mir-6791 GACGGCCTCTGGTTCCTCCGTCTCCGTCAAGTCTGGAGAAAGGCGGACGGGTCGGGGTCCCCAGACC >AllPreMir_hsa-mir-6791 CCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG >SeedChanged_hsa-mir-6791_5p_seed_changed CCAGACCCGACGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG >SeedChanged_hsa-mir-6791_3p_seed_changed CCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCGAGCTTGGTCTCCGGCAG >Scrambled_hsa-mir-6791 TCCGGCTGCCCCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAATCTGCCTCCTTGGTCCAG >Reversed_hsa-mir-18a ACGGTCTTCCTCGTGAATCCCGTCATCTACGATTAGATGAAGTGATAGACGTGATCTACGTGGAATCTTGT >AllPreMir_hsa-mir-18a TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA >SeedChanged_hsa-mir-18a_5p_seed_changed TGTTCTAACCAGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA >SeedChanged_hsa-mir-18a_3p_seed_changed TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTCGGCTAAGTGCTCCTTCTGGCA >Scrambled_hsa-mir-18a TGTTCTAAGGTGCATCTAGTAAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCATAGTGGCAG
So we got the original sequence tagged with "AllPreMir" then we designed a few modifications to create 4 additional sequences from the original, one is reversed order, one is random order and two sequences with only 3 different nucleotides.
After designing the sequences we run a few experiments and got different groups of sequences (based on experimental results), for example group one would be:
>SeedChanged_hsa-mir-6791_5p_seed_changed CCAGACCCGACGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG >AllPreMir_hsa-mir-18a TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA >SeedChanged_hsa-mir-18a_5p_seed_changed TGTTCTAACCAGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
And the second group would have:
>Scrambled_hsa-mir-6791 TCCGGCTGCCCCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAATCTGCCTCCTTGGTCCAG >Reversed_hsa-mir-18a ACGGTCTTCCTCGTGAATCCCGTCATCTACGATTAGATGAAGTGATAGACGTGATCTACGTGGAATCTTGT >Scrambled_hsa-mir-18a TGTTCTAAGGTGCATCTAGTAAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCATAGTGGCAG
And now my goal is to find meaningful differences between the two groups (while the actual files would have hundreds of sequences some could be very similar). Such differences could be different k-mers, different "consensus" sequences, different motifs, something else??? And in a perfect world there would be a tool (or several tools) that would perform those analysis on a fasta file and output those metrics to a file that I could compare between the different groups of sequences.