Is There Any Script To Count Repeats In Protein Sequences?
4
2
Entering edit mode
13.0 years ago

Hi, there are several script to count simple sequence repeat (SSR) in dna sequences (like MISA). i want to know that is there any script (in perl or python) like MISA to calculate repeats in protein sequences? for example i want to know that how many repeats (2,3,4 and 5) of amino acids are in a set of protein sequences. Thanks a lot for any help.

perl bioperl python • 5.7k views
ADD COMMENT
2
Entering edit mode
13.0 years ago
benjwoodcroft ▴ 170

Is it necessary that the repeats have to be perfect? Maybe you could try something like XSTREAM? I realise it isn't Perl or Python, but maybe you could parse the output somehow? It might be better to build on other people's work than write something from scratch, though you might disagree in this instance.

ADD COMMENT
1
Entering edit mode
13.0 years ago

For the DNA aspect of your question, I would use RepeatMasker. I searched for "RepeatMasker parser" and found this tool under bioperl. There are filters that BLASTP uses, as an example, that filter low-complexity protein seq. I find protein repeats more difficult and challenging to define. For example, is it a repeat if a large protein domain is present in more than one copy in a protein sequence? It might be worth it to run a kind of dot-plot self against self analysis to see the repeated elements of a protein sequence.

ADD COMMENT
0
Entering edit mode
13.0 years ago
Sequencegeek ▴ 740

The script below should work. Use the countRepeat fxn:

  • Seq: your protein amino acid sequence

  • minTimes: The minimum times a kmer should repeat before it is counted as a repeat-

  • kmerLength: The length of the repeat (1,2,3,4,5,etc.)

Note: this will find all repeats for all slides (kind of like reading frame) on a sequence. So if you only want it for a specific slide then you'll have to modify the script.

Feel free to ask for any clarifications,
Good Luck!

def returnFrames(sequence, frameLength):
    '''Given a sequence and length of frame, return a list of frames from sequence'''

    if frameLength > len(sequence):
        print 'the length of the frame is larger than the sequence itself!!!'
        return 1

    frames = []
    i = 0

    while (i+frameLength) < (len(sequence)+1):
        frames.append(sequence[i:(i + frameLength)])
        i = i + 1

    return frames

def countRepeat(seq, minTimes, kmerLength):

        #get all di seqs, uniquify
        kSeqs = bioLibCG.returnFrames(seq, kmerLength)
        kSeqs = set(kSeqs)

        for kmer in kSeqs:
                #slides
                for slide in range(0, kmerLength):

                        sLen = 0
                        for i in range(slide, len(seq), kmerLength):

                                try:
                                        if seq[i:i + kmerLength] == kmer:
                                                sLen += 1
                                        else:
                                                #if stretch is long enough
                                                if sLen >= minTimes:
                                                        print kmer, slide, sLen

                                except IndexError:
                                        if sLen > minTimes:
                                                print kmer, slide, sLen
ADD COMMENT
0
Entering edit mode

Thanks for your answer. I convert your script to XXX.py file and put my fasta file to a folder and run the script but it doesn't work. i don't know what is my mistake (I am a bit freshman in bioinformatics). and also i have several sequences and is it works for multiple sequences? in fact i need an output with number of repeats, type of repeats in sequences separately and also name of sequences which has repeat (like MISA script). Thanks again for your favor. regards

ADD REPLY

Login before adding your answer.

Traffic: 2139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6