How To Merge Consensus Motif To Degenerate Motif ? [Motif Merge Or Motif Clustering]
2
2
Entering edit mode
8.6 years ago
Yu ▴ 130

Hi all!

I want to merge consensus motif to degenerate motif as below,

– consensus motif

ACGGGTA

GCGGGTC

ACGGGTC

GCGGGTA

– degenerate motif

RCGGGTM

{G|A}CGGGT{A|C}

I googled it but didn't find the answer.

Can you tell me some algorithm or principle of it?

For example , I have a list of consensus motifs

motif1 ACGGGTA

motif2 GCGGGTC

motif3 ACGGGTC

motif4 GCGGGTA

motif5 AGGGGTC

motif6 GCGGGTT

The degenerate motif can be

method 1.

motif 1-motif4 , motif6 RCGGGTH

motif 5 AGGGGTC

method 2.

motif 2 , motif5 ASGGGTC

motif 1 , motif 3-4 , motif 6 RCGGGTH

Which method is right?

Thanks.

EDIT : My question come from using K-mer to search DNA seqence , I want to clustering consensus motif to degenerate motif like the example above.

motif merge clustering consensus • 4.8k views
ADD COMMENT
0
Entering edit mode

Your edit did not change the question, my solution does produce the degenerative sequence with ambiguity codes as you requested. If you have something else in mind when you say "clustering" you will have to elaborate on that and make it clear (or create a new post if you want something other than what you indicted above).

ADD REPLY
0
Entering edit mode

"clustering" - I mean "How similar motifs can merged together ?" I'm sorry that my question is not clear. As you wish, I marking the question answered. Thanks for your help.

ADD REPLY
4
Entering edit mode
8.6 years ago
SES 8.5k

It sounds like you are trying to generate a consensus with IUPAC ambiguity codes. Here is a BioPerl example:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::Seq;
use Bio::AlignIO;
use Bio::Tools::IUPAC;

my $alnio = Bio::AlignIO->new(-fh => \*DATA, -format => 'fasta');

while (my $aln = $alnio->next_aln) {
    my $cs = $aln->consensus_iupac;
    my $seq = Bio::Seq->new(-seq => $cs, -alphabet => 'dna');
    print "With IUPAC codes: $cs\n";
    print "IUPAC codes meaning: ";
    my $iupac  = Bio::Tools::IUPAC->new(-seq => $seq);
    my %dnasymbols = $iupac->iupac_iub;
    my @residues = split //, $cs;
    for my $r (@residues) {
        if (exists $dnasymbols{$r}) {
            if (scalar @{$dnasymbols{$r}} > 1) {
                print "[";
                print join "|",@{$dnasymbols{$r}};
                print "]";
            }
            else {
                print @{$dnasymbols{$r}};
            } 
        }
    }
    print "\n";
}

__DATA__
>seq1
ACGGGTA 
>seq2
GCGGGTC
>seq3
ACGGGTC
>seq4
GCGGGTA

This produces what you requested in the first part of your question:

$ perl biostars66538.pl
With IUPAC codes: RCGGGTM
IUPAC codes meaning: [A|G]CGGGT[A|C]

I could not follow the last part of your post, so please elaborate if there was more to your question.

ETA: I noticed that you wanted the iupac meanings also, so I added that part. There are probably other ways to do this, but a small script will give you something you can easily change to suit your needs. A quick search turned up similar methods in bioruby and biopython, so I would also consider those options if you want to implement something yourself.

ADD COMMENT
0
Entering edit mode

Thanks for you code example! My question come from using K-mer to search DNA seqence , I want to clustering the motif I found.

ADD REPLY
0
Entering edit mode

You did not mention clustering in your question and it's difficult to provide a good answer as a comment. If the above solution works for you, I recommend you marking the question answered and then create a new post where you can ask about clustering.

ADD REPLY
0
Entering edit mode

Nice solution, but what about cases where motifs are different length?

ADD REPLY
1
Entering edit mode

I hadn't thought about that situation because it wasn't part of the question, but if that is something you are interested in I'm sure it could be handled pretty easy. There is a method called something like "gap_col_matrix" in Bio::AlignIO that will allow you to find gaps. If you wanted to treat these sites differently, that is where I'd start.

ADD REPLY
0
Entering edit mode

with these sequences:

__DATA__
>seq1 ACGGGTA 
>seq2 ACGGGTT
>seq3 ACGGGAT

I Obtain :

With IUPAC codes: ACGGGWW
IUPAC codes meaning: ACGGG[A|T][A|T]

While ACGGGAA doesn't exist in the input sequences !

ADD REPLY
0
Entering edit mode

The output isn't telling you ACGGGAA exists in the input, it is telling you ACGGG[A|T][A|T] exists in the input (see seq1 and seq3) and we are showing the consensus.

ADD REPLY
1
Entering edit mode
8.6 years ago

You could play around with SLiMMaker to generate the consensus motifs:

http://bioware.soton.ac.uk/slimmaker.html

You would have to change the parameters a bit - I used 1 No. of sequences for aa to be in. 4 Max number of different aa's at a single position. N- and ticked use DNA rather than peptides.

ADD COMMENT
0
Entering edit mode

Thanks, I try SLiMMaker and it can give me the result [AG][CG]GGGT[ACT].

I need a principle for why a motif A and motif B can be merged.

For example, I input all 4-mer motif to SLiMMaker , then SLiMMaker will give me the result [ACGT][ACGT][ACGT][ACGT]

Thats not what I was looking for at all.

ADD REPLY

Login before adding your answer.

Traffic: 1838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6