Question: Python: Optimizing my amino acid % composition code
1
4.8 years ago by
michaelrw1010
michaelrw1010 wrote:

I know that it is a big annoyance when people come on here to get homework answers. While this is for a homework assignment, you can rest assured that I am not looking for a handout or straight answers; I am merely looking for help. Also, the assignment is actually completed, and I am just looking for some help in optimizing the code and/or reworking it using a different approach.

The first assignment was to use a codon table dictionary to translate a provided DNA string into 3-letter amino acid sequence. This part has already been submitted. The next assignment asked us to then take that sequence and determine the relative percent composition of each aa in the polypeptide, while also utilizing a dictionary that will convert the 3-letter sequence into a single-letter sequence. So, as you can see, I converted the 3-letter into a 1-letter string. From there i did a series of count variables, one for each AA. each was then divided by the total AA in the chain. So, I have accomplished the assignment. That being said, I feel that there may be an easier way of doing this without having to repeat 20 lines of nearly-identical code for each of the 20 amino acids. I thought maybe there would be a way to use a for/while loop that could help reduce some of the code, but I just dont know.

Also, fwiw, my undergrad work was done in biology and chemistry. I am a first-year grad student studying molecular pharmacology, and this bioinfo python course is the first programming I have ever done. Any thoughts, ideas, suggestions would be greatly appreciated. Also, I am limited to default python, biopython will be introduced later in the term

#3-letter aa dictionary
codon_dict = {
#             Second Base
#    U              C             A             G
# U
'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys', # UxU
'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys', # UxC
'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---', # UxA
'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Trp', # UxG
# C
'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg', # CxU
'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg', # CxC
'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg', # CxA
'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg', # CxG
# A
'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser', # AxU
'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser', # AxC
'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg', # AxA
'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg', # AxG
# G
'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly', # GxU
'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly', # GxC
'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly', # GxA
'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly'  # GxG
}
#single-letter aa dictionary
singleletter_dict = {'Cys': 'C', 'Asp': 'D', 'Ser': 'S', 'Gln': 'Q', 'Lys': 'K', 'Trp': 'W', 'Thr': 'T', 'Asn': 'N', 'Pro': 'P', 'Phe': 'F', 'Ala': 'A', 'Gly': 'G', 'Ile': 'I', 'Leu': 'L', 'His': 'H', 'Arg': 'R', 'Met': 'M', 'Val': 'V', 'Glu': 'E', 'Tyr': 'Y', '---': '*'}

#original cDNA sequence
DNA_string = "ATGGCGACTGTCGAACCGGAAACCACCCCTACTCCTAATCCCCCGACTACAGAAGAGGAGAAAACGGAATCTAATCAGGAGGTTGCTAACCCAGAACACTATATTAAACATCCCCTACAGAACAGATGGGCACTCTGGTTTTTTAAAAATGATAAAAGCAAAACTTGGCAAGCAAACCTGCGGCTGATCTCCAAGTTTGATACTGTTGAAGACTTTTGGGCTCTGTACAACCATATCCAGTTGTCTAGTAATTTAATGCCTGGCTGTGACTACTCACTTTTTAAGGATGGTATTGAGCCTATGTGGGAAGATGAGAAAAACAAACGGGGAGGACGATGGCTAATTACATTGAACAAACAGCAGAGACGAAGTGACCTCGATCGCTTTTGGCTAGAGACACTTCTGTGCCTTATTGGAGAATCTTTTGATGACTACAGTGATGATGTATGTGGCGCTGTTGTTAATGTTAGAGCTAAAGGTGATAAGATAGCAATATGGACTACTGAATGTGAAAACAGAGAAGCTGTTACACATATAGGGAGGGTATACAAGGAAAGGTTAGGACTTCCTCCAAAGATAGTGATTGGTTATCAGTCCCACGCAGACACAGCTACTAAGAGCGGCTCCACCACTAAAAATAGGTTTGTTGTTTAA"

RNA_string = DNA_string.replace('T', 'U')
print "This is the RNA that will be translated into a protein: " + RNA_string + '\n'    # to demonstrate that T was replaced with U so the codon table will work

threeletterstring=''
aminoacid = ''
for i in range( 0, len(RNA_string), 3 ):
codon = RNA_string[i:i+3]
aminoacid = codon_dict[codon]
threeletterstring += aminoacid
print "This is the translated protein, designated by the 3-letter amino acid abbreviations: " + threeletterstring + '\n'

# ===================================================================================== #
# ================ PART 2 BEGINS ================ PART 2 BEGINS =============== #
# ===================================================================================== #

#converting the above 3-letter polypeptide string into a single-letter polypeptide string
singleletterstring=''

### my original for loop before consolidating ###
'''
for i in range( 0, len(threeletterstring), 3 ):
threeletteraa = threeletterstring[i:i+3]
singleletteraa = singleletter_dict[threeletteraa]
singleletterstring = singleletterstring + singleletteraa
print singleletterstring
'''

### my new for loop after consolidating ###
for i in range( 0, len(threeletterstring), 3 ):
singleletterstring += singleletter_dict[threeletterstring[i:i+3]]
print "This is the translated protein, designated by the 1-letter amino acid abbreviations: " + singleletterstring + '\n'

print "This is the relative composition of each amino acid in the translated polypeptide:"
countC = singleletterstring.count('C')
countD = singleletterstring.count('D')
countS = singleletterstring.count('S')
countQ = singleletterstring.count('Q')
countK = singleletterstring.count('K')
countW = singleletterstring.count('W')
countT = singleletterstring.count('T')
countN = singleletterstring.count('N')
countP = singleletterstring.count('P')
countF = singleletterstring.count('F')
countA = singleletterstring.count('A')
countG = singleletterstring.count('G')
countI = singleletterstring.count('I')
countL = singleletterstring.count('L')
countH = singleletterstring.count('H')
countR = singleletterstring.count('R')
countM = singleletterstring.count('M')
countV = singleletterstring.count('V')
countE = singleletterstring.count('E')
countY = singleletterstring.count('Y')

countSTOP = singleletterstring.count('*')

totalaa = float(len(singleletterstring))

print "the relative C composition is " + str(countC/totalaa*100) + "%"
print "the relative D composition is " + str(countD/totalaa*100) + "%"
print "the relative S composition is " + str(countS/totalaa*100) + "%"
print "the relative Q composition is " + str(countQ/totalaa*100) + "%"
print "the relative K composition is " + str(countK/totalaa*100) + "%"
print "the relative W composition is " + str(countW/totalaa*100) + "%"
print "the relative T composition is " + str(countT/totalaa*100) + "%"
print "the relative N composition is " + str(countN/totalaa*100) + "%"
print "the relative P composition is " + str(countP/totalaa*100) + "%"
print "the relative F composition is " + str(countF/totalaa*100) + "%"
print "the relative A composition is " + str(countA/totalaa*100) + "%"
print "the relative G composition is " + str(countG/totalaa*100) + "%"
print "the relative I composition is " + str(countI/totalaa*100) + "%"
print "the relative L composition is " + str(countL/totalaa*100) + "%"
print "the relative H composition is " + str(countH/totalaa*100) + "%"
print "the relative R composition is " + str(countR/totalaa*100) + "%"
print "the relative M composition is " + str(countM/totalaa*100) + "%"
print "the relative V composition is " + str(countV/totalaa*100) + "%"
print "the relative E composition is " + str(countE/totalaa*100) + "%"
print "the relative Y composition is " + str(countY/totalaa*100) + "%"
print "there is/are " + str(countSTOP) + " STOP codons"
bioinformatics python • 3.1k views
modified 4.8 years ago by Felix_Sim250 • written 4.8 years ago by michaelrw1010

I'd begin by abstracting the counting to a new method. That way, the code is written/modified once, but called as many times as required. Pass the amino acid as a parameter to this method, and use it to output count. Also, you could create a string of all unique amino acids and iterate through it with a loop - that would make the code easier to maintain as well.

EDIT: I'd like to add that this is beautiful code. I'd never imagine that this was written by a beginner - at least documentation and variable naming wise. Logic-wise, we always learn something new. Lambdas might be of use at multiple places here.

I appreciate the reply, and thank you for the compliment. I may have under-emphasized my experience with 'programming', but I didn't feel like getting into the details because I feel my experience is negligible, by some standards. This is, indeed, my first formal training. Previously I have worked my way through writing some basic scripts in a language called Papyrus that is used for the video game Skyrim. I also have tweaked existing HTML and javascript widgets that I use on my smartphone.. never wrote one from scratch, though. These things were done only as a hobby, but it did give me a good introduction to some of the fundamentals. It also taught me that, in order to better understand the flow of a particular code, i like to use variable names that make sense to me, and also try to keep everything very organized.

At any rate, I'm not sure I know how exactly I could accomplish 'abstracting the counting to a new method'. I do understand the purpose of it though, which is in line with my original motivation for writing this post (trying to simplify the code; remove redundant/unnecessary/repetitive code).

With respect to creating a unique aa string (a 20-character string that represents each of the 20 aa).. how could one then use that to compare other aa strings?

I appreciate the help, thank you

-MW

You're very welcome - I should have used the word "extract", not "abstract" there - when you face a repetitive task that involves a bunch of code executing multiple times with uniform, predictable parts changing in each execution, you can create a generic task (as a method) and pass in the item that changes as a parameter. As Felix has shown below, you can use loops, or you can pass in the AAs as parameters and create a method that defines the task just once - this is typically useful for longer, more complicated tasks.

This makes code more maintainable. You end up changing fewer lines of code per change in logic/requirement and the change remains consistent because partial changes (such as changes that were made on just 19 of the 20 lines by mistake) would be invalid and would not work.

Modifying Felix's code, assuming you did not need to store the AA counts, you could just do this:

for aa in singleletter_dict.values():
print "the relative {letter} composition is {composition}%".format(letter=aa,              composition=float(singleletterstring.count(aa)/len(singleletterstring)*100)

Or if you wanted cleaner code, you could still compute the counts, the overall count and the composition separately, and use them in the same code block, as shown below:

count_all=len(singleletterstring)
for aa in singleletter_dict.values():
count_aa=singleletterstring.count(aa)
composition=float(count_aa)/count_all*100
print "the relative {letter} composition is {composition}%".format(letter=aa, composition=composition)
1
4.8 years ago by
Felix_Sim250
United Kingdom
Felix_Sim250 wrote:

I would substitute all the individual amino acid processing lines with loops or functions (as suggested by Ram).

countC = singleletterstring.count('C')
countD = singleletterstring.count('D')
countS = singleletterstring.count('S')
countQ = singleletterstring.count('Q')
countK = singleletterstring.count('K')
countW = singleletterstring.count('W')
countT = singleletterstring.count('T')
countN = singleletterstring.count('N')
countP = singleletterstring.count('P')
countF = singleletterstring.count('F')
countA = singleletterstring.count('A')
countG = singleletterstring.count('G')
countI = singleletterstring.count('I')
countL = singleletterstring.count('L')
countH = singleletterstring.count('H')
countR = singleletterstring.count('R')
countM = singleletterstring.count('M')
countV = singleletterstring.count('V')
countE = singleletterstring.count('E')
countY = singleletterstring.count('Y')

countSTOP = singleletterstring.count('*')

totalaa = float(len(singleletterstring))

This can be replaced by something like the following.

counts = {}
for aa in singleletter_dict.values():
counts[aa] = singleletterstring.count(aa)
count_all = len(singleletterstring)

The second part are all the print statements. Generally, this is not a problem but a nightmare to maintain. Imagine you want to change this statement in the future and you have to edit each line individually.

print "the relative C composition is " + str(countC/totalaa*100) + "%"
print "the relative D composition is " + str(countD/totalaa*100) + "%"
print "the relative S composition is " + str(countS/totalaa*100) + "%"
print "the relative Q composition is " + str(countQ/totalaa*100) + "%"
print "the relative K composition is " + str(countK/totalaa*100) + "%"
print "the relative W composition is " + str(countW/totalaa*100) + "%"
print "the relative T composition is " + str(countT/totalaa*100) + "%"
print "the relative N composition is " + str(countN/totalaa*100) + "%"
print "the relative P composition is " + str(countP/totalaa*100) + "%"
print "the relative F composition is " + str(countF/totalaa*100) + "%"
print "the relative A composition is " + str(countA/totalaa*100) + "%"
print "the relative G composition is " + str(countG/totalaa*100) + "%"
print "the relative I composition is " + str(countI/totalaa*100) + "%"
print "the relative L composition is " + str(countL/totalaa*100) + "%"
print "the relative H composition is " + str(countH/totalaa*100) + "%"
print "the relative R composition is " + str(countR/totalaa*100) + "%"
print "the relative M composition is " + str(countM/totalaa*100) + "%"
print "the relative V composition is " + str(countV/totalaa*100) + "%"
print "the relative E composition is " + str(countE/totalaa*100) + "%"
print "the relative Y composition is " + str(countY/totalaa*100) + "%"
print "there is/are " + str(countSTOP) + " STOP codons"

You can replace this whole block with the following:

for aa, count in sorted(counts.iteritems()):
if aa=="*":
print "there is/are {count} STOP codons".format(count=count)
else:
composition = float(count)/count_all*100
print "the relative {letter} composition is {composition}%".format(letter=aa,
composition=composition)

In both cases you replace repetitive blocks of code with a loop structure. This allows you to make changes easily in the future by simply editing a single line. If you do ever require some complexity, you could always implement if-else statements.

1

One note on efficiency. If you count amino acids this way:

counts = {}
for aa in singleletter_dict.values():
counts[aa] = singleletterstring.count(aa)

you will actually iterate through singleletterstring 21 times (each time count method is called it goes through the entire string). But the same result can be achieved iterating through it only once. You can write your own function for this:

counts = {}
for aa in singleletter_dict.values():
counts[aa] = 0

for aa in singleletterstring:
counts[aa] += 1

Even better way is to utilize a function from python standard library if you're already familiar with imports. For example:

from collections import Counter
polypeptide = 'VGSDCTTIHYNYMCNSSCMGGMNRRPILTITTLEDSSGNLLGRNSFEVRVCACPGRDRRT'
count = Counter(polypeptide)

You will receive a result like this:

Counter({'A': 1,
'C': 5,
'D': 3,
'E': 2,
'F': 1,
'G': 6,
'H': 1,
'I': 3,
'L': 4,
'M': 3,
'N': 5,
'P': 2,
'R': 7,
'S': 6,
'T': 6,
'V': 3,
'Y': 2})

Thank you for this. It's a great way to approach the counting problem, and I've now added it to my list of problem solving techniques :)

I cannot thank you enough for taking the time to read through and make sense of everything, then even use my own variable names in your response.

This is precisely what I was looking for. Now the trick is going to be breaking it down and understanding how you implemented it.. the 'why', if you will. We just had the lecture on for loops last week.. some of them are pretty straightforward, but I am still trying to learn all the various ways you can manipulate them.

For example, when you put "for aa, count in sorted(counts.iteritems()):"

I have experience with using the "for i in range(a, b, c)" which i understand will do whatever I want (iterate?) across the range. I struggle to comprehend what is going on when "range(a, b, c)" is replaced with something else, or when the 'i' is replaced with anything you want, or in your code there are two 'things'/'variables' where 'i' usually goes.. my grasp on these things is a bit shaky. Yeah, i know that if I want to do something over and over that i can essentially write something with syntax that somewhat resembles my original code, but I dont fully understand why i write it that way, and/or what each piece does.

take this for example (taken from one of our lecture slides)

for item1, item2 in dictA.keys():
do something with keys and values

what do we call item1 and item2? what are we telling the program when we list 2 items in that part of syntax?

what sort of freedom do we have with the dictA.keys() component? we have seen that one can put a defined variable there, it can be a python function such as range, etc.

If I had a better understanding of all of this, I probably wouldnt feel as helpless.

I digress. I appreciate your response and I am going to take this code and play around with it in my interpreter and see if I can't break it down, line by line to see exactly what is going on 'behind the scenes' so to speak. thx

When you use a loop with a dictionary, you can have the iterator yield both the key and the value for each item in the dictionary, and store each in a variable. That's what the for aa, count in sorted(counts.iteritems()) does - it stores the amino acid (the key) and the counts (the corresponding value) in 2 variables - aa and count, respectively. It also sorts the sorts the dictionary by key, so the amino acids are output in alphabetical order.

Also, the keys() in your code would only give you the keys, so storing them in 2 variables (like you have shown above) may not work as expected.