Question

Edited Title: Finding differential sequence characteristics between two groups of sequences

1

Entering edit mode

4.2 years ago

artemd ▴ 10

Previous title was "sequence similarity analysis between groups of closely related DNA sequences". Was edited for clarity

Hello,

I'll start by trying to explain my problem briefly: The data: We got a list of a few hundreds short sequences (pre-miRNA sequences with length of ~100). Then we expanded the list by modifying those sequences (sometimes modifying the whole sequence completely sometimes modifying only a few nucleotides). So now we have a few thousand of short sequences and some of them very similar to each other. We did some experiments with those sequences and we separated them into groups.

The goal: we want to find how different groups of sequences might differ from other groups.

I thought tools for motif analysis might help me but from a quick search it looks like most of them geared towards genome analysis and not for individual sequences.

Any ideas or directions for me to search?

Thanks,

Artem.

EDIT: additional information

I'll try to add more details to explain the problem more clearly. Here is an example from our fasta:

>Reversed_hsa-mir-6791
GACGGCCTCTGGTTCCTCCGTCTCCGTCAAGTCTGGAGAAAGGCGGACGGGTCGGGGTCCCCAGACC
>AllPreMir_hsa-mir-6791
CCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG
>SeedChanged_hsa-mir-6791_5p_seed_changed
CCAGACCCGACGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG
>SeedChanged_hsa-mir-6791_3p_seed_changed
CCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCGAGCTTGGTCTCCGGCAG
>Scrambled_hsa-mir-6791
TCCGGCTGCCCCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAATCTGCCTCCTTGGTCCAG
>Reversed_hsa-mir-18a
ACGGTCTTCCTCGTGAATCCCGTCATCTACGATTAGATGAAGTGATAGACGTGATCTACGTGGAATCTTGT
>AllPreMir_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
>SeedChanged_hsa-mir-18a_5p_seed_changed
TGTTCTAACCAGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
>SeedChanged_hsa-mir-18a_3p_seed_changed
TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTCGGCTAAGTGCTCCTTCTGGCA
>Scrambled_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTAAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCATAGTGGCAG

So we got the original sequence tagged with "AllPreMir" then we designed a few modifications to create 4 additional sequences from the original, one is reversed order, one is random order and two sequences with only 3 different nucleotides.

After designing the sequences we run a few experiments and got different groups of sequences (based on experimental results), for example group one would be:

>SeedChanged_hsa-mir-6791_5p_seed_changed
CCAGACCCGACGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG
>AllPreMir_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
>SeedChanged_hsa-mir-18a_5p_seed_changed
TGTTCTAACCAGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA

And the second group would have:

>Scrambled_hsa-mir-6791
TCCGGCTGCCCCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAATCTGCCTCCTTGGTCCAG
>Reversed_hsa-mir-18a
ACGGTCTTCCTCGTGAATCCCGTCATCTACGATTAGATGAAGTGATAGACGTGATCTACGTGGAATCTTGT
>Scrambled_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTAAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCATAGTGGCAG

And now my goal is to find meaningful differences between the two groups (while the actual files would have hundreds of sequences some could be very similar). Such differences could be different k-mers, different "consensus" sequences, different motifs, something else??? And in a perfect world there would be a tool (or several tools) that would perform those analysis on a fasta file and output those metrics to a file that I could compare between the different groups of sequences.

genome sequence gene • 1.5k views

ADD COMMENT • link 4.2 years ago by artemd ▴ 10

0

Entering edit mode

Have you tried different metrics with which you can comment on the degree of variance in between different groups of sequences? For example : pairwise nucleotide differences in between a group and among different groups? Or, maybe have a "consensus" sequence for each "group" and then check its pairwise nucleotide difference with "consensus" sequence other groups?

ADD REPLY • link 4.2 years ago by manaswwm ▴ 490

0

Entering edit mode

Hello, thanks for the comment. I don't have any metrics in particular and I thought to get those from this questions. I thought about checking for; different k-mers abundance, different GC content, different consensus sequence. Though, maybe users of the forum think of something different/more specific. And even more helpful would be some kind of tool/script that can check those things and report those statistics to a file.

ADD REPLY • link 4.2 years ago by artemd ▴ 10

0

Entering edit mode

You could use clumpify.sh from BBMap suite to clump the sequences based on their sequence similarity (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). This will work with both fastq/fasta data.

You could then take representative sequences from the group to build "phylogenetic trees" that can display their potential relationships.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the suggestion. I tried reading the description of the tool briefly and it looks like it could be useful for my problem. Though I'm not sure how to exactly use it; lets say I have two fasta files with 200 sequences in each and I know to know if there are some metrics by which the sequences in those two files differ, I run the tool on both files and receive a file where similar sequences are clamped together. Then how would you suggest to proceed?

ADD REPLY • link 4.2 years ago by artemd ▴ 10

score 1 · Answer 1 · 2020-02-17

1

Entering edit mode

4.2 years ago

michau ▴ 60

If I understand correctly you could just align the dataset and look for conservation patterns. With Clustal Omega (clustalo --output-order=tree-order) you can sort the output based on sequence similarity, so you would see groups of sequences that align correctly.

Otherwise, you can cluster the sequences wit CD-HIT and align groups separately.

You would likely be looking for consensus sequence, or LOGO (when you are viewing sequences using Jalview (https://www.jalview.org/) you have LOGOs and other statistics on the bottom of alignment).

ADD COMMENT • link 4.2 years ago by michau ▴ 60

0

Entering edit mode

As I understand your suggestion is to align the sequences between themselves and to cluster them based on sequence similarities, I tried the clustao omega online aligner with a small sample of sequences. the output looks nice and it really can cluster sequences based on similarities. But I find it hard to fathom how to apply this tool to my particular problem (I replied the same to @genomax comment).

Maybe the additional information I added to the original post would help to illustrate my problem better and you could point me to a more specific solution you are familiar with?

Thanks for your help and suggestions thus far.

ADD REPLY • link 4.2 years ago by artemd ▴ 10

score 1 · Answer 2 · 2020-02-17

1

Entering edit mode

4.2 years ago

Malcolm.Cook ★ 1.5k

From your description, my guess is that you want to characterize which kind of modifications (M) result in your original unmodified sequence (US) to be assigned the same experimental group (USG) as their modified counterparts (MS) experimentally assigned group (MSG). If so, it would help to know if you already have a way you characterize the nature of the modifications (M). For example, do you already have a little language in which the modifications are defined? If so, can you provide an example 5 column table with columns: US M MS USG MSG - this would help to clarify the nature of the problem. If I am describing your problem, you would then be looking for values of M for which USG and MSG are always (or usually) the same, and those values of M for which USG and MSG are always (or usually) different. Further, do the group preserving/changing Ms have something structurally in common.

ADD COMMENT • link 4.2 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Hi Malcolm, thanks for your answer. I think your suggestion could be a decent direction to take. Though I think an easier (as in more straight forward) solution would be to find differences between individual sequences not taking into account the original/modified sequences. What do you think?

I added additional information to the original post to make it more understandable.

ADD REPLY • link 4.2 years ago by artemd ▴ 10

0

Entering edit mode

Well, my comment/guess intends to better elicit from you what your actual problem is. I read the answers provided so far, and felt they might be aimed at answering a different question than you have. With a clearer problem statement, I, or others, might make a more informed stab at a solution. To me, at this point, it's not really a matter of which is "easier" or "more straight forward" solution. Rather, it is which problem statement asks the scientific questions you seek to answer.

ADD REPLY • link 4.2 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Hi, sorry for the late reply and I hope you might still remember the issue we were discussing here. Your answer was spot on, I'm currently doing the analysis you were talking about and I believe it will give us some important insights about our experiments. Though, since you probably the one in this thread who understood our problem the best, maybe you could suggest me a way to perform analysis from the sequence point of view? what methods could I use to find "differential sequence characteristics" between the different groups? (besides simply coding to manually find metrics such as GC%, kmers and some other repeating elements)

ADD REPLY • link 4.2 years ago by artemd ▴ 10

0

Entering edit mode

Hi. Not sure I can help further. If you are able to fully respond to my original "Answer" to your question, I might have some ideas for you.

ADD REPLY • link 4.1 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Oh right, to address your original answer - The different modifications unfortunately don't have any defined characteristics. So currently I can do (and will do) an analysis comparing between the modified and unmodified sequences and look if this affects the experimental results (and maybe find effects that are general for one type of modifications, though I doubt the effect will be so large). So as I mentioned previously, I would like to perform additional analysis where I disregard the information of original/modified sequences and only look at the nucleotides of each sequence.

ADD REPLY • link 4.1 years ago by artemd ▴ 10

score 1 · Answer 3 · 2020-02-18

1

Entering edit mode

4.2 years ago

5heikki 11k

Calculate pairwise distances between the sequences and apply affinity propagation

ADD COMMENT • link 4.2 years ago by 5heikki 11k

0

Entering edit mode

Thanks for your answer, I read some literature on the subject you suggested and I don't think its applicable for my problem. (unless I misunderstood something which is possible).

I believe the original title of the question about similarities of closely related sequences misrepresented my problem. I'm searching for directions to find what unique characteristics might be present in two groups of DNA sequences (the DNA sequences might be similar to each other).

If I misunderstood something or if you got a different suggestion I'll be more than happy to hear about it. Thanks.

ADD REPLY • link 4.2 years ago by artemd ▴ 10