Question

Extracting four fold degenerate sites from nucleotide alignment

0

Entering edit mode

8.6 years ago

A • 0

Hi all and python users,

I am looking for a script written in python or a feature in Biopython or guidance for writing a script in python that would extract four fold degenerate sites from a nucleotide alignment file of about 90 nucleotide sequences. The output file should have four fold degenerate sites per sequence which I want to use to make a new alignment and build phylogeny.

Thinking that it would a simple thing to do, I tried to search (google) for it but very strangely, I am not finding right answer. Please help me get in the right direction.

Very many thanks,

A

alignment • 7.6k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by A • 0

Ram · Answer 1 · 2015-09-11

This should get you started. I borrowed most of the code from http://www.petercollingridge.co.uk/python-bioinformatics-tools/codon-table

bases = ['t', 'c', 'a', 'g']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

def translate(seq):
    seq = seq.lower().replace('\n', '').replace(' ', '')
    peptide = ''

    for I in xrange(0, len(seq), 3):
        codon = seq[i: i+3]
        amino_acid = codon_table.get(codon, '*')
        if amino_acid != '*':
            peptide += amino_acid
        else:
            break

    return peptide

codon_dict={}
for aa in set(amino_acids):
    temp=[]
    for c in codons:
       if translate(c)==aa:
           temp.append(c)
    if len(temp)==4:
        codon_dict[aa]=temp

print codon_dict

Now you should be able to run through your alignment and check whether the codon is in the codon_dict...

Ram · Answer 2 · 2015-09-10

I was looking for something along these lines late last year. I came up short, and (to my knowledge) there still isn't anything available. I attribute this to 1) a lack of demand and 2) a perceived difficulty in programming the task.

But, it's easy enough that I bet everyone who has written a script just didn't make it public.

If you want to use Python, simply create a dictionary based on the translation table of your organism of interest and the degeneracy (0, 1, 2, 3, 4-fold) for each position given a codon. Then, scan across your nucleotides in triplets and identify the degeneracy from the table. It's just a bit of a pain to set up because you have to type it all in, but it shouldn't take much time. This could be done in any scripting language and is basically just string processing.

Sorry for no solution directly, but I hope this points you in the right direction.

Ram · Answer 3 · 2015-09-11

1

Entering edit mode

8.6 years ago

h.mon 35k

There is a script in Perl provided by thackl here. However, as I commented before, if you have a large number of ambiguous sites, the number of sequences and file size may grow very quickly.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by h.mon 35k