Hello,
I am having a lot of trouble with a program that I am writing and I cannot find my error. The program is two fold, first it is supposed to take a gbk file and parse all of the coding regions (CDS regions). Then it should take the total coding regions com pined and count the numbed of amino acid codons. The program runs fine but the output is wrong, I have checked this with several reputable sources. Please help, I have included my code, my output and what the correct output should look like.
Thank you
Code:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
Sequencetemp = Seq("", generic_dna)
sequencefull = Seq("", generic_dna)
concatenated = Seq("", generic_dna)
key=0
value=0
i=0
def check(Sequencetemp):
if Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TAA" or Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TAG" or Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TGA":
return 1
for seq_record in SeqIO.parse("NC_000913.gbk", "genbank"):
sequencefull=seq_record.seq
for seq_feature in seq_record.features:
if seq_feature.type=='CDS':
if "locus_tag" in seq_feature.qualifiers and "product" in seq_feature.qualifiers:
start = seq_feature.location.start
end = seq_feature.location.end
if seq_feature.strand == -1:
Sequencetemp = sequencefull[start:end:-1]
elif seq_feature.strand == 1:
Sequencetemp = sequencefull[start:end:1]
concatenated = concatenated + Sequencetemp
if len(concatenated)%3!=0:
print("Error: CDS regions not read correctly")
print('\n\n aa codon total\n')
codons = {"Ala/A" : ["GCC","GCT","GCA"],"Arg/R" : ["CGT","CGC","CGA","CGG","AGA","AGG"],
"Asn/N" : ["AAT","AAC"], "Asp/D" : ["GAT","GAC"],"Cys/C" : ["TGT","TGC"],
"Gln/Q" : ["CAA","CAG"],"Glu/E" : ["GAA","GAG"], "Gly/G" : ["GGT","GGC","GGA","GGG"],
"His/H" : ["CAT","CAC"], "Ile/I" : ["ATT","ATC","ATA"],"Met/M" : ["ATG"],
"Leu/L" : ["TTA","TTG","CTT","CTC","CTA","CTG"],"Lys/K" : ["AAA","AAG"],
"Phe/F" : ["TTT","TTC"],"Pro/P" : ["CCT","CCC","CCA","CCG"],
"Ser/S" : ["TCT","TCC","TCA","TCG", "AGT", "AGC"], "Thr/T" : ["ACT","ACC","ACA","ACG"],
"Trp/W" : ["TGG"], "Tyr/Y" : ["TAT","TAC"], "Val/V" : ["GTT","GTC","GTA","GTG"],
"STOP " : ["TAA","TGA","TAG"],}
for key, value in codons.items():
for I in value:
print('%s %s %d'%(key, i, concatenated.count(i)))
MY OUTPUT:
aa codon total
His/H CAT 25386
His/H CAC 22411
Ile/I ATT 35128
Ile/I ATC 32697
Ile/I ATA 17829
Met/M ATG 37031
Ala/A GCC 35547
Ala/A GCT 38110
Ala/A GCA 39735
Ser/S TCT 22739
Ser/S TCC 19936
Ser/S TCA 27385
Ser/S TCG 28920
Ser/S AGT 18238
Ser/S AGC 31220
Lys/K AAA 34013
Lys/K AAG 29834
Tyr/Y TAT 28033
Tyr/Y TAC 22115
Gly/G GGT 34142
Gly/G GGC 44695
Gly/G GGA 25934
Gly/G GGG 18308
Gln/Q CAA 30002
Gln/Q CAG 36056
Asp/D GAT 39449
Asp/D GAC 22989
Arg/R CGT 32599
Arg/R CGC 44869
Arg/R CGA 32570
Arg/R CGG 36486
Arg/R AGA 19984
Arg/R AGG 19233
Trp/W TGG 49052
Val/V GTT 32543
Val/V GTC 22679
Val/V GTA 20654
Val/V GTG 31673
Asn/N AAT 29272
Asn/N AAC 34935
Leu/L TTA 28841
Leu/L TTG 32196
Leu/L CTT 20728
Leu/L CTC 16586
Leu/L CTA 12922
Leu/L CTG 53563
Pro/P CCT 20670
Pro/P CCC 13967
Pro/P CCA 25441
Pro/P CCG 38828
Cys/C TGT 23024
Cys/C TGC 41620
Phe/F TTT 26283
Phe/F TTC 27362
STOP TAA 22058
STOP TGA 41221
STOP TAG 7496
Thr/T ACT 21936
Thr/T ACC 29456
Thr/T ACA 21430
Thr/T ACG 29764
Glu/E GAA 41982
Glu/E GAG 15868
Correct Output Should Be:
aa codon total percent
Ala/A GCC 33872 %27.0
Ala/A GCT 20006 %16.0
Ala/A GCA 26537 %21.2
Ala/A GCG 44899 %35.8
Arg/R CGT 27834 %38.4
Arg/R CGG 6996 %9.6
Arg/R CGC 29291 %40.4
Arg/R AGA 2501 %3.4
Arg/R CGA 4542 %6.3
Arg/R AGG 1365 %1.9
Asn/N AAT 22752 %44.6
Asn/N AAC 28309 %55.4
Asp/D GAT 42121 %62.6
Asp/D GAC 25198 %37.4
Cys/C TGC 8484 %55.8
Cys/C TGT 6712 %44.2
Gln/Q CAA 20207 %34.6
Gln/Q CAG 38131 %65.4
Glu/E GAA 52314 %69.1
Glu/E GAG 23432 %30.9
Gly/G GGA 10232 %10.6
Gly/G GGC 39360 %40.7
Gly/G GGT 32650 %33.8
Gly/G GGG 14465 %15.0
His/H CAT 16940 %56.9
His/H CAC 12809 %43.1
Ile/I ATA 5345 %6.8
Ile/I ATT 40166 %51.0
Ile/I ATC 33315 %42.3
Leu/L TTA 18090 %12.9
Leu/L CTA 5075 %3.6
Leu/L CTC 14702 %10.5
Leu/L CTG 70360 %50.0
Leu/L CTT 14399 %10.2
Leu/L TTG 17990 %12.8
Lys/K AAA 44219 %76.8
Lys/K AAG 13369 %23.2
Met/M ATG 36683 %100.0
Phe/F TTC 21822 %42.7
Phe/F TTT 29300 %57.3
Pro/P CCT 9135 %15.7
Pro/P CCG 30966 %53.1
Pro/P CCC 7136 %12.2
Pro/P CCA 11063 %19.0
STOP TGA 1194 %28.7
STOP TAG 287 %6.9
STOP TAA 2685 %64.5
Ser/S TCA 9158 %12.1
Ser/S TCG 11748 %15.5
Ser/S TCT 10984 %14.5
Ser/S AGC 21127 %27.9
Ser/S AGT 11322 %15.0
Ser/S TCC 11311 %15.0
Thr/T ACT 11580 %16.4
Thr/T ACA 8979 %12.7
Thr/T ACG 18967 %26.9
Thr/T ACC 30949 %43.9
Trp/W TGG 20055 %100.0
Tyr/Y TAT 21057 %56.7
Tyr/Y TAC 16076 %43.3
Val/V GTG 34786 %37.3
Val/V GTC 20224 %21.7
Val/V GTA 14323 %15.3
Val/V GTT 24029 %25.7
Thank you for the reply it was very helpful.
1) I believe you are right that the frame is not correctly maintained, I believe the problem is not all of the start/stop are correct. What I mean by this is that sequence_temp%3 is not always true. Do you have any idea why this would be?
2) and 3) I wanted to get the counts right before I worried about formatting and calculating the percentages.
I took your advice and re wrote the code in an attempt to just get the correct counts and I am getting an error where on of the codons only contains two letters. Specifically the error message is
My new code:
I'm not entirely sure what's causing the bug. People will include partial CDSs in their annotation, but in my experience, annotations are not very well standardized and the chances of running into malformed data can be high. If a CDS is partial, you may have to pay special attention to the /codon_start annotation, because it will tell you to skip one or two bases. Depending on how important it is to you to include every CDS possible, you could attempt to prevent some of these errors beforehand, for example by immediately checking if the
CDS length %3 == 0
and only entering the loop if it is. Some other suggestions that may or may not help you:1) Rather than use your if/else statements to try and extract the sequences, Biopython already has some functions to accomplish the same thing. The extract method of a
SeqFeature
object will extract the sequence for you, and then you can use the.reverse_complement()
method if the CDS is on the minus strand. You can therefore avoid trying to extract the sequence yourself using list slicing (which, by the way, doesn't take care of the complement part of the negative strand, just the reverse part).2) I've never tried to use the feature location/start/end to extract sequences, so this may or may not be true, but I would be careful about "human-readable" indexing vs computer indexing. If you have a sequence, ATGCCCGGGTAA, the Genbank feature would be from 1..12, but the string in Python is indexed 0..11. When you save the feature start/end, if the values of those are 1/12 instead of 0/11, when you plug those numbers into the Python string slicing (e.g. sequencefull[start:end]), you might end up with the wrong data ("TGCCCGGGTAA" instead of the whole string). Also keep in mind that list slicing in Python is exclusive at the endpoint. I suspect that this might be causing your unexpected partial codon, because the program isn't slicing in the "human" way-- it might be reading TGC/CCG/GGT/AA. The other advantage of using the extract method mentioned above is that it (according to the docs) will handle exon/intron splicing, which may not be a problem if you're only working with bacteria but would be a huge problem in eukaryotic data sets.