How To Merge Consensus Motif To Degenerate Motif ? [Motif Merge Or Motif Clustering]
2
2
Entering edit mode
9.5 years ago
Yu ▴ 130

Hi all!

I want to merge consensus motif to degenerate motif as below,

– consensus motif

ACGGGTA

GCGGGTC

ACGGGTC

GCGGGTA

– degenerate motif

RCGGGTM

{G|A}CGGGT{A|C}

Can you tell me some algorithm or principle of it?

For example , I have a list of consensus motifs

motif1 ACGGGTA

motif2 GCGGGTC

motif3 ACGGGTC

motif4 GCGGGTA

motif5 AGGGGTC

motif6 GCGGGTT

The degenerate motif can be

method 1.

motif 1-motif4 , motif6 RCGGGTH

motif 5 AGGGGTC

method 2.

motif 2 , motif5 ASGGGTC

motif 1 , motif 3-4 , motif 6 RCGGGTH

Which method is right?

Thanks.

EDIT : My question come from using K-mer to search DNA seqence , I want to clustering consensus motif to degenerate motif like the example above.

motif merge clustering consensus • 5.4k views
0
Entering edit mode

Your edit did not change the question, my solution does produce the degenerative sequence with ambiguity codes as you requested. If you have something else in mind when you say "clustering" you will have to elaborate on that and make it clear (or create a new post if you want something other than what you indicted above).

0
Entering edit mode

"clustering" - I mean "How similar motifs can merged together ?" I'm sorry that my question is not clear. As you wish, I marking the question answered. Thanks for your help.

4
Entering edit mode
9.5 years ago
SES 8.5k

It sounds like you are trying to generate a consensus with IUPAC ambiguity codes. Here is a BioPerl example:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::Seq;
use Bio::AlignIO;
use Bio::Tools::IUPAC;

my $alnio = Bio::AlignIO->new(-fh => \*DATA, -format => 'fasta'); while (my$aln = $alnio->next_aln) { my$cs = $aln->consensus_iupac; my$seq = Bio::Seq->new(-seq => $cs, -alphabet => 'dna'); print "With IUPAC codes:$cs\n";
print "IUPAC codes meaning: ";
my $iupac = Bio::Tools::IUPAC->new(-seq =>$seq);
my %dnasymbols = $iupac->iupac_iub; my @residues = split //,$cs;
for my $r (@residues) { if (exists$dnasymbols{$r}) { if (scalar @{$dnasymbols{$r}} > 1) { print "["; print join "|",@{$dnasymbols{$r}}; print "]"; } else { print @{$dnasymbols{$r}}; } } } print "\n"; } __DATA__ >seq1 ACGGGTA >seq2 GCGGGTC >seq3 ACGGGTC >seq4 GCGGGTA  This produces what you requested in the first part of your question: $ perl biostars66538.pl
With IUPAC codes: RCGGGTM
IUPAC codes meaning: [A|G]CGGGT[A|C]


ETA: I noticed that you wanted the iupac meanings also, so I added that part. There are probably other ways to do this, but a small script will give you something you can easily change to suit your needs. A quick search turned up similar methods in bioruby and biopython, so I would also consider those options if you want to implement something yourself.

0
Entering edit mode

Thanks for you code example! My question come from using K-mer to search DNA seqence , I want to clustering the motif I found.

0
Entering edit mode

You did not mention clustering in your question and it's difficult to provide a good answer as a comment. If the above solution works for you, I recommend you marking the question answered and then create a new post where you can ask about clustering.

0
Entering edit mode

Nice solution, but what about cases where motifs are different length?

1
Entering edit mode

I hadn't thought about that situation because it wasn't part of the question, but if that is something you are interested in I'm sure it could be handled pretty easy. There is a method called something like "gap_col_matrix" in Bio::AlignIO that will allow you to find gaps. If you wanted to treat these sites differently, that is where I'd start.

0
Entering edit mode

with these sequences:

__DATA__
>seq1 ACGGGTA
>seq2 ACGGGTT
>seq3 ACGGGAT


I Obtain :

With IUPAC codes: ACGGGWW
IUPAC codes meaning: ACGGG[A|T][A|T]


While ACGGGAA doesn't exist in the input sequences !

0
Entering edit mode

The output isn't telling you ACGGGAA exists in the input, it is telling you ACGGG[A|T][A|T] exists in the input (see seq1 and seq3) and we are showing the consensus.

1
Entering edit mode
9.5 years ago

You could play around with SLiMMaker to generate the consensus motifs:

http://bioware.soton.ac.uk/slimmaker.html

You would have to change the parameters a bit - I used 1 No. of sequences for aa to be in. 4 Max number of different aa's at a single position. N- and ticked use DNA rather than peptides.

0
Entering edit mode

Thanks, I try SLiMMaker and it can give me the result [AG][CG]GGGT[ACT].

I need a principle for why a motif A and motif B can be merged.

For example, I input all 4-mer motif to SLiMMaker , then SLiMMaker will give me the result [ACGT][ACGT][ACGT][ACGT]

Thats not what I was looking for at all.