Question: Extracting four fold degenerate sites from nucleotide alignment
gravatar for A
4.5 years ago by
United Kingdom
A0 wrote:

Hi all and python users,

I am looking for a script written in python or a feature in Biopython or guidance for writing a script in python that would extract four fold degenerate sites from a nucleotide alignment file of about 90 nucleotide sequences. The output file should have four fold degenerate sites per sequence which I want to use to make a new alignment and build phylogeny.

Thinking that it would a simple thing to do, I tried to search (google) for it but very strangely, I am not finding right answer. Please help me get in the right direction.

Very many thanks,


alignment • 4.2k views
ADD COMMENTlink modified 4.5 years ago by h.mon29k • written 4.5 years ago by A0
gravatar for Brice Sarver
4.5 years ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

I was looking for something along these lines late last year. I came up short, and (to my knowledge) there still isn't anything available. I attribute this to 1) a lack of demand and 2) a perceived difficulty in programming the task.

But, it's easy enough that I bet everyone who has written a script just didn't make it public.

If you want to use Python, simply create a dictionary based on the translation table of your organism of interest and the degeneracy (0, 1, 2, 3, 4-fold) for each position given a codon. Then, scan across your nucleotides in triplets and identify the degeneracy from the table. It's just a bit of a pain to set up because you have to type it all in, but it shouldn't take much time. This could be done in any scripting language and is basically just string processing.

Sorry for no solution directly, but I hope this points you in the right direction.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Brice Sarver3.5k
gravatar for Whetting
4.5 years ago by
Bethesda, MD
Whetting1.5k wrote:

This should get you started. I borrowed most of the code from ( )


bases = ['t', 'c', 'a', 'g']
codons = [a+b+c for a in bases for b in bases for c in bases]
codon_table = dict(zip(codons, amino_acids))

def translate(seq):
    seq = seq.lower().replace('\n', '').replace(' ', '')
    peptide = ''
    for i in xrange(0, len(seq), 3):
        codon = seq[i: i+3]
        amino_acid = codon_table.get(codon, '*')
        if amino_acid != '*':
            peptide += amino_acid
    return peptide

for aa in set(amino_acids):
    for c in codons:
       if translate(c)==aa:
    if len(temp)==4:
print codon_dict



Now you should be able to run through your alignment and check whether the codon is in the codon_dict...

ADD COMMENTlink written 4.5 years ago by Whetting1.5k

Many thanks for getting me started with the code. I'll give it a go and will let you know if successful!!

ADD REPLYlink written 4.4 years ago by A0
gravatar for h.mon
4.5 years ago by
h.mon29k wrote:

There is a script in Perl provided by thackl here. However, as I commented before, if you have a large number of ambiguous sites, the number of sequences and file size may grow very quickly.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by h.mon29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 795 users visited in the last hour