Question

Python: Optimizing my amino acid % composition code

1

Entering edit mode

8.2 years ago

michaelrw10 ▴ 10

I know that it is a big annoyance when people come on here to get homework answers. While this is for a homework assignment, you can rest assured that I am not looking for a handout or straight answers; I am merely looking for help. Also, the assignment is actually completed, and I am just looking for some help in optimizing the code and/or reworking it using a different approach.

The first assignment was to use a codon table dictionary to translate a provided DNA string into 3-letter amino acid sequence. This part has already been submitted. The next assignment asked us to then take that sequence and determine the relative percent composition of each aa in the polypeptide, while also utilizing a dictionary that will convert the 3-letter sequence into a single-letter sequence. So, as you can see, I converted the 3-letter into a 1-letter string. From there I did a series of count variables, one for each AA. each was then divided by the total AA in the chain. So, I have accomplished the assignment. That being said, I feel that there may be an easier way of doing this without having to repeat 20 lines of nearly-identical code for each of the 20 amino acids. I thought maybe there would be a way to use a for/while loop that could help reduce some of the code, but I just don't know.

Also, fwiw, my undergrad work was done in biology and chemistry. I am a first-year grad student studying molecular pharmacology, and this bioinfo python course is the first programming I have ever done. Any thoughts, ideas, suggestions would be greatly appreciated. Also, I am limited to default python, biopython will be introduced later in the term

#3-letter aa dictionary
codon_dict = {
#             Second Base
#    U              C             A             G
# U
'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys', # UxU
'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys', # UxC
'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---', # UxA
'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Trp', # UxG
# C
'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg', # CxU
'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg', # CxC
'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg', # CxA
'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg', # CxG
# A
'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser', # AxU
'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser', # AxC
'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg', # AxA
'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg', # AxG
# G
'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly', # GxU
'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly', # GxC
'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly', # GxA
'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly'  # GxG
}
#single-letter aa dictionary
singleletter_dict = {'Cys': 'C', 'Asp': 'D', 'Ser': 'S', 'Gln': 'Q', 'Lys': 'K', 'Trp': 'W', 'Thr': 'T', 'Asn': 'N', 'Pro': 'P', 'Phe': 'F', 'Ala': 'A', 'Gly': 'G', 'Ile': 'I', 'Leu': 'L', 'His': 'H', 'Arg': 'R', 'Met': 'M', 'Val': 'V', 'Glu': 'E', 'Tyr': 'Y', '---': '*'}

#original cDNA sequence
DNA_string = "ATGGCGACTGTCGAACCGGAAACCACCCCTACTCCTAATCCCCCGACTACAGAAGAGGAGAAAACGGAATCTAATCAGGAGGTTGCTAACCCAGAACACTATATTAAACATCCCCTACAGAACAGATGGGCACTCTGGTTTTTTAAAAATGATAAAAGCAAAACTTGGCAAGCAAACCTGCGGCTGATCTCCAAGTTTGATACTGTTGAAGACTTTTGGGCTCTGTACAACCATATCCAGTTGTCTAGTAATTTAATGCCTGGCTGTGACTACTCACTTTTTAAGGATGGTATTGAGCCTATGTGGGAAGATGAGAAAAACAAACGGGGAGGACGATGGCTAATTACATTGAACAAACAGCAGAGACGAAGTGACCTCGATCGCTTTTGGCTAGAGACACTTCTGTGCCTTATTGGAGAATCTTTTGATGACTACAGTGATGATGTATGTGGCGCTGTTGTTAATGTTAGAGCTAAAGGTGATAAGATAGCAATATGGACTACTGAATGTGAAAACAGAGAAGCTGTTACACATATAGGGAGGGTATACAAGGAAAGGTTAGGACTTCCTCCAAAGATAGTGATTGGTTATCAGTCCCACGCAGACACAGCTACTAAGAGCGGCTCCACCACTAAAAATAGGTTTGTTGTTTAA"

RNA_string = DNA_string.replace('T', 'U')
print "This is the RNA that will be translated into a protein: " + RNA_string + '\n'    # to demonstrate that T was replaced with U so the codon table will work

threeletterstring=''
aminoacid = ''
for I in range( 0, len(RNA_string), 3 ):
        codon = RNA_string[i:i+3]
        aminoacid = codon_dict[codon]
        threeletterstring += aminoacid
print "This is the translated protein, designated by the 3-letter amino acid abbreviations: " + threeletterstring + '\n'

# ===================================================================================== #
# ================ PART 2 BEGINS ================ PART 2 BEGINS =============== #
# ===================================================================================== #

#converting the above 3-letter polypeptide string into a single-letter polypeptide string
singleletterstring=''

### my original for loop before consolidating ###
'''
for I in range( 0, len(threeletterstring), 3 ):
    threeletteraa = threeletterstring[i:i+3]
    singleletteraa = singleletter_dict[threeletteraa]
    singleletterstring = singleletterstring + singleletteraa
print singleletterstring
'''

### my new for loop after consolidating ###
for I in range( 0, len(threeletterstring), 3 ):
    singleletterstring += singleletter_dict[threeletterstring[i:i+3]]
print "This is the translated protein, designated by the 1-letter amino acid abbreviations: " + singleletterstring + '\n'

print "This is the relative composition of each amino acid in the translated polypeptide:"
countC = singleletterstring.count('C')
countD = singleletterstring.count('D')
countS = singleletterstring.count('S')
countQ = singleletterstring.count('Q')
countK = singleletterstring.count('K')
countW = singleletterstring.count('W')
countT = singleletterstring.count('T')
countN = singleletterstring.count('N')
countP = singleletterstring.count('P')
countF = singleletterstring.count('F')
countA = singleletterstring.count('A')
countG = singleletterstring.count('G')
countI = singleletterstring.count('I')
countL = singleletterstring.count('L')
countH = singleletterstring.count('H')
countR = singleletterstring.count('R')
countM = singleletterstring.count('M')
countV = singleletterstring.count('V')
countE = singleletterstring.count('E')
countY = singleletterstring.count('Y')

countSTOP = singleletterstring.count('*')

totalaa = float(len(singleletterstring))

print "the relative C composition is " + str(countC/totalaa*100) + "%"
print "the relative D composition is " + str(countD/totalaa*100) + "%"
print "the relative S composition is " + str(countS/totalaa*100) + "%"
print "the relative Q composition is " + str(countQ/totalaa*100) + "%"
print "the relative K composition is " + str(countK/totalaa*100) + "%"
print "the relative W composition is " + str(countW/totalaa*100) + "%"
print "the relative T composition is " + str(countT/totalaa*100) + "%"
print "the relative N composition is " + str(countN/totalaa*100) + "%"
print "the relative P composition is " + str(countP/totalaa*100) + "%"
print "the relative F composition is " + str(countF/totalaa*100) + "%"
print "the relative A composition is " + str(countA/totalaa*100) + "%"
print "the relative G composition is " + str(countG/totalaa*100) + "%"
print "the relative I composition is " + str(countI/totalaa*100) + "%"
print "the relative L composition is " + str(countL/totalaa*100) + "%"
print "the relative H composition is " + str(countH/totalaa*100) + "%"
print "the relative R composition is " + str(countR/totalaa*100) + "%"
print "the relative M composition is " + str(countM/totalaa*100) + "%"
print "the relative V composition is " + str(countV/totalaa*100) + "%"
print "the relative E composition is " + str(countE/totalaa*100) + "%"
print "the relative Y composition is " + str(countY/totalaa*100) + "%"
print "there is/are " + str(countSTOP) + " STOP codons"

python • 5.3k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by michaelrw10 ▴ 10

0

Entering edit mode

I'd begin by abstracting the counting to a new method. That way, the code is written/modified once, but called as many times as required. Pass the amino acid as a parameter to this method, and use it to output count. Also, you could create a string of all unique amino acids and iterate through it with a loop - that would make the code easier to maintain as well.

EDIT: I'd like to add that this is beautiful code. I'd never imagine that this was written by a beginner - at least documentation and variable naming wise. Logic-wise, we always learn something new. Lambdas might be of use at multiple places here.

ADD REPLY • link 8.2 years ago by Ram 43k

0

Entering edit mode

I appreciate the reply, and thank you for the compliment. I may have under-emphasized my experience with 'programming', but I didn't feel like getting into the details because I feel my experience is negligible, by some standards. This is, indeed, my first formal training. Previously I have worked my way through writing some basic scripts in a language called Papyrus that is used for the video game Skyrim. I also have tweaked existing HTML and javascript widgets that I use on my smartphone.. never wrote one from scratch, though. These things were done only as a hobby, but it did give me a good introduction to some of the fundamentals. It also taught me that, in order to better understand the flow of a particular code, i like to use variable names that make sense to me, and also try to keep everything very organized.

At any rate, I'm not sure I know how exactly I could accomplish 'abstracting the counting to a new method'. I do understand the purpose of it though, which is in line with my original motivation for writing this post (trying to simplify the code; remove redundant/unnecessary/repetitive code).

With respect to creating a unique aa string (a 20-character string that represents each of the 20 aa).. how could one then use that to compare other aa strings?

I appreciate the help, thank you

-MW

ADD REPLY • link 8.2 years ago by michaelrw10 ▴ 10

0

Entering edit mode

You're very welcome - I should have used the word "extract", not "abstract" there - when you face a repetitive task that involves a bunch of code executing multiple times with uniform, predictable parts changing in each execution, you can create a generic task (as a method) and pass in the item that changes as a parameter. As Felix has shown below, you can use loops, or you can pass in the AAs as parameters and create a method that defines the task just once - this is typically useful for longer, more complicated tasks.

This makes code more maintainable. You end up changing fewer lines of code per change in logic/requirement and the change remains consistent because partial changes (such as changes that were made on just 19 of the 20 lines by mistake) would be invalid and would not work.

Modifying Felix's code, assuming you did not need to store the AA counts, you could just do this:

for aa in singleletter_dict.values():
    print "the relative {letter} composition is {composition}%".format(letter=aa,              composition=float(singleletterstring.count(aa)/len(singleletterstring)*100)

Or if you wanted cleaner code, you could still compute the counts, the overall count and the composition separately, and use them in the same code block, as shown below:

count_all=len(singleletterstring)
for aa in singleletter_dict.values():
    count_aa=singleletterstring.count(aa)
    composition=float(count_aa)/count_all*100
    print "the relative {letter} composition is {composition}%".format(letter=aa, composition=composition)

ADD REPLY • link 4.3 years ago by Ram 43k

Ram · Answer 1 · 2016-02-03

I would substitute all the individual amino acid processing lines with loops or functions (as suggested by Ram).

countC = singleletterstring.count('C')
countD = singleletterstring.count('D')
countS = singleletterstring.count('S')
countQ = singleletterstring.count('Q')
countK = singleletterstring.count('K')
countW = singleletterstring.count('W')
countT = singleletterstring.count('T')
countN = singleletterstring.count('N')
countP = singleletterstring.count('P')
countF = singleletterstring.count('F')
countA = singleletterstring.count('A')
countG = singleletterstring.count('G')
countI = singleletterstring.count('I')
countL = singleletterstring.count('L')
countH = singleletterstring.count('H')
countR = singleletterstring.count('R')
countM = singleletterstring.count('M')
countV = singleletterstring.count('V')
countE = singleletterstring.count('E')
countY = singleletterstring.count('Y')

countSTOP = singleletterstring.count('*')

totalaa = float(len(singleletterstring))

This can be replaced by something like the following.

counts = {}
for aa in singleletter_dict.values():
    counts[aa] = singleletterstring.count(aa)
count_all = len(singleletterstring)

The second part are all the print statements. Generally, this is not a problem but a nightmare to maintain. Imagine you want to change this statement in the future and you have to edit each line individually.

print "the relative C composition is " + str(countC/totalaa*100) + "%"
print "the relative D composition is " + str(countD/totalaa*100) + "%"
print "the relative S composition is " + str(countS/totalaa*100) + "%"
print "the relative Q composition is " + str(countQ/totalaa*100) + "%"
print "the relative K composition is " + str(countK/totalaa*100) + "%"
print "the relative W composition is " + str(countW/totalaa*100) + "%"
print "the relative T composition is " + str(countT/totalaa*100) + "%"
print "the relative N composition is " + str(countN/totalaa*100) + "%"
print "the relative P composition is " + str(countP/totalaa*100) + "%"
print "the relative F composition is " + str(countF/totalaa*100) + "%"
print "the relative A composition is " + str(countA/totalaa*100) + "%"
print "the relative G composition is " + str(countG/totalaa*100) + "%"
print "the relative I composition is " + str(countI/totalaa*100) + "%"
print "the relative L composition is " + str(countL/totalaa*100) + "%"
print "the relative H composition is " + str(countH/totalaa*100) + "%"
print "the relative R composition is " + str(countR/totalaa*100) + "%"
print "the relative M composition is " + str(countM/totalaa*100) + "%"
print "the relative V composition is " + str(countV/totalaa*100) + "%"
print "the relative E composition is " + str(countE/totalaa*100) + "%"
print "the relative Y composition is " + str(countY/totalaa*100) + "%"
print "there is/are " + str(countSTOP) + " STOP codons"

You can replace this whole block with the following:

for aa, count in sorted(counts.iteritems()):
    if aa=="*":
        print "there is/are {count} STOP codons".format(count=count)
    else:
        composition = float(count)/count_all*100
        print "the relative {letter} composition is {composition}%".format(letter=aa,
                                                                           composition=composition)

In both cases you replace repetitive blocks of code with a loop structure. This allows you to make changes easily in the future by simply editing a single line. If you do ever require some complexity, you could always implement if-else statements.