Question

Reading CDS regions from a gbk file and counting amino acid codons in Biopython

0

Entering edit mode

8.8 years ago

cameronmartino • 0

Hello,

I am having a lot of trouble with a program that I am writing and I cannot find my error. The program is two fold, first it is supposed to take a gbk file and parse all of the coding regions (CDS regions). Then it should take the total coding regions com pined and count the numbed of amino acid codons. The program runs fine but the output is wrong, I have checked this with several reputable sources. Please help, I have included my code, my output and what the correct output should look like.

Thank you

Code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

Sequencetemp = Seq("", generic_dna)
sequencefull = Seq("", generic_dna)
concatenated = Seq("", generic_dna)
key=0
value=0
i=0

def check(Sequencetemp):
    if Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TAA" or Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TAG" or  Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TGA": 
        return 1 

for seq_record in SeqIO.parse("NC_000913.gbk", "genbank"):
    sequencefull=seq_record.seq
    for seq_feature in seq_record.features:
        if seq_feature.type=='CDS':
            if "locus_tag" in seq_feature.qualifiers and "product" in seq_feature.qualifiers:
                start = seq_feature.location.start
                end = seq_feature.location.end
                if seq_feature.strand == -1:
                    Sequencetemp = sequencefull[start:end:-1]
                elif seq_feature.strand == 1:
                    Sequencetemp = sequencefull[start:end:1]
                concatenated = concatenated + Sequencetemp

if len(concatenated)%3!=0:
    print("Error: CDS regions not read correctly")

print('\n\n aa   codon  total\n')

codons = {"Ala/A" : ["GCC","GCT","GCA"],"Arg/R" : ["CGT","CGC","CGA","CGG","AGA","AGG"],
          "Asn/N" : ["AAT","AAC"], "Asp/D" : ["GAT","GAC"],"Cys/C" : ["TGT","TGC"], 
          "Gln/Q" : ["CAA","CAG"],"Glu/E" : ["GAA","GAG"], "Gly/G" : ["GGT","GGC","GGA","GGG"],
          "His/H" : ["CAT","CAC"], "Ile/I" : ["ATT","ATC","ATA"],"Met/M" : ["ATG"], 
          "Leu/L" : ["TTA","TTG","CTT","CTC","CTA","CTG"],"Lys/K" : ["AAA","AAG"], 
          "Phe/F" : ["TTT","TTC"],"Pro/P" : ["CCT","CCC","CCA","CCG"], 
          "Ser/S" : ["TCT","TCC","TCA","TCG", "AGT", "AGC"], "Thr/T" : ["ACT","ACC","ACA","ACG"],
          "Trp/W" : ["TGG"], "Tyr/Y" : ["TAT","TAC"], "Val/V" : ["GTT","GTC","GTA","GTG"], 
          "STOP " : ["TAA","TGA","TAG"],}

for key, value in codons.items():
    for I in value:
        print('%s  %s   %d'%(key,  i, concatenated.count(i)))

MY OUTPUT:

aa   codon  total
His/H  CAT   25386
His/H  CAC   22411
Ile/I  ATT   35128
Ile/I  ATC   32697
Ile/I  ATA   17829
Met/M  ATG   37031
Ala/A  GCC   35547
Ala/A  GCT   38110
Ala/A  GCA   39735
Ser/S  TCT   22739
Ser/S  TCC   19936
Ser/S  TCA   27385
Ser/S  TCG   28920
Ser/S  AGT   18238
Ser/S  AGC   31220
Lys/K  AAA   34013
Lys/K  AAG   29834
Tyr/Y  TAT   28033
Tyr/Y  TAC   22115
Gly/G  GGT   34142
Gly/G  GGC   44695
Gly/G  GGA   25934
Gly/G  GGG   18308
Gln/Q  CAA   30002
Gln/Q  CAG   36056
Asp/D  GAT   39449
Asp/D  GAC   22989
Arg/R  CGT   32599
Arg/R  CGC   44869
Arg/R  CGA   32570
Arg/R  CGG   36486
Arg/R  AGA   19984
Arg/R  AGG   19233
Trp/W  TGG   49052
Val/V  GTT   32543
Val/V  GTC   22679
Val/V  GTA   20654
Val/V  GTG   31673
Asn/N  AAT   29272
Asn/N  AAC   34935
Leu/L  TTA   28841
Leu/L  TTG   32196
Leu/L  CTT   20728
Leu/L  CTC   16586
Leu/L  CTA   12922
Leu/L  CTG   53563
Pro/P  CCT   20670
Pro/P  CCC   13967
Pro/P  CCA   25441
Pro/P  CCG   38828
Cys/C  TGT   23024
Cys/C  TGC   41620
Phe/F  TTT   26283
Phe/F  TTC   27362
STOP   TAA   22058
STOP   TGA   41221
STOP   TAG   7496
Thr/T  ACT   21936
Thr/T  ACC   29456
Thr/T  ACA   21430
Thr/T  ACG   29764
Glu/E  GAA   41982
Glu/E  GAG   15868

Correct Output Should Be:

aa         codon  total     percent
Ala/A  GCC    33872  %27.0
Ala/A  GCT    20006  %16.0
Ala/A  GCA   26537  %21.2
Ala/A  GCG   44899  %35.8
Arg/R  CGT    27834  %38.4
Arg/R  CGG   6996    %9.6
Arg/R  CGC    29291  %40.4
Arg/R  AGA   2501    %3.4
Arg/R  CGA   4542    %6.3
Arg/R  AGG   1365    %1.9
Asn/N AAT   22752  %44.6
Asn/N AAC   28309  %55.4
Asp/D GAT   42121  %62.6
Asp/D GAC   25198  %37.4
Cys/C  TGC    8484    %55.8
Cys/C  TGT    6712    %44.2
Gln/Q  CAA   20207  %34.6
Gln/Q  CAG   38131  %65.4
Glu/E   GAA   52314  %69.1
Glu/E   GAG   23432  %30.9
Gly/G  GGA   10232  %10.6
Gly/G  GGC   39360  %40.7
Gly/G  GGT   32650  %33.8
Gly/G  GGG   14465  %15.0
His/H  CAT    16940  %56.9
His/H  CAC    12809  %43.1
Ile/I      ATA   5345    %6.8
Ile/I      ATT    40166  %51.0
Ile/I      ATC    33315  %42.3
Leu/L   TTA    18090  %12.9
Leu/L   CTA    5075    %3.6
Leu/L   CTC    14702  %10.5
Leu/L   CTG    70360  %50.0
Leu/L   CTT    14399  %10.2
Leu/L   TTG    17990  %12.8
Lys/K  AAA   44219  %76.8
Lys/K  AAG   13369  %23.2
Met/M            ATG   36683  %100.0
Phe/F   TTC    21822  %42.7
Phe/F   TTT    29300  %57.3
Pro/P   CCT    9135    %15.7
Pro/P   CCG    30966  %53.1
Pro/P   CCC    7136    %12.2
Pro/P   CCA    11063  %19.0
STOP  TGA   1194    %28.7
STOP  TAG   287      %6.9
STOP  TAA   2685    %64.5
Ser/S    TCA    9158    %12.1
Ser/S    TCG    11748  %15.5
Ser/S    TCT    10984  %14.5
Ser/S    AGC   21127  %27.9
Ser/S    AGT   11322  %15.0
Ser/S    TCC    11311  %15.0
Thr/T  ACT    11580  %16.4
Thr/T  ACA   8979    %12.7
Thr/T  ACG   18967  %26.9
Thr/T  ACC    30949  %43.9
Trp/W TGG   20055  %100.0
Tyr/Y  TAT    21057  %56.7
Tyr/Y  TAC    16076  %43.3
Val/V   GTG   34786  %37.3
Val/V   GTC    20224  %21.7
Val/V   GTA   14323  %15.3
Val/V   GTT    24029  %25.7

gbk codon amino-acid CDS • 3.9k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by cameronmartino • 0

Ram · Answer 1 · 2015-07-31

3

Entering edit mode

8.8 years ago

athey.johnc ▴ 60

I'm somewhat confused by your question, as I'm not sure which errors you're referring to in particular.

I believe the counts that you're seeing are different because the .count() method you're using on the concatenated string does not maintain the proper reading frame. The function is non-overlapping according to this page, but that doesn't mean it keeps the reading frame maintained.
If you take issue with the order your data is being printed in, that's because dicts in Python are inherently unordered. If you want it to preserve the order you entered it in, you need to look into using an OrderedDict instead.
I'm pretty sure you aren't seeing any percentage data because you never told it to print any. The print statement in your code only has 3 outputs: the amino acid, the codon, and the count for that codon.

What you might try instead is to initialize a dict with the codons and their counts (e.g. codon_count = {'ATG': 0, 'ATT': 0, 'ATC': 0, ... }), and then read through each separate CDS (what I think you call Sequencetemp) in chunks of [n, n+3], e.g.:

for n in range(0, len(Sequencetemp), 3):
     codon = Sequencetemp[n, n+3]
     codon_count[codon] += 1

This will give you the totals for each codon in all the CDSs in the file. Then, at the end, you can iterate through the dict you have, codons, and calculate a total for all the codons in the value of each key; then just divide the value for each codon (in codon_count) by the total for the codons encoding the same AA. (I haven't tested this code exactly, but I have a similar program that calculates the same data in a slightly different way.)

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.8 years ago by athey.johnc ▴ 60

0

Entering edit mode

Thank you for the reply it was very helpful.

1) I believe you are right that the frame is not correctly maintained, I believe the problem is not all of the start/stop are correct. What I mean by this is that sequence_temp%3 is not always true. Do you have any idea why this would be?

2) and 3) I wanted to get the counts right before I worried about formatting and calculating the percentages.

I took your advice and re wrote the code in an attempt to just get the correct counts and I am getting an error where on of the codons only contains two letters. Specifically the error message is

---> 32 codon_count[codon] += 1
     33 else:
     34    Sequencetemp = str(sequencefull[start:end:1])

KeyError: 'AA'

My new code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

codon_count = {"GCC" : 0,"GCT" : 0,"GCA" : 0,"GCG" : 0,"CGT" : 0 ,"CGC" : 0,"CGA" : 0,"CGG" : 0,"AGA" : 0,"AGG" : 0, 
               "AAT" : 0,"AAC" : 0,"GAT" : 0,"GAC" : 0, "TGT" : 0,"TGC" : 0,"CAA" : 0,"CAG" : 0,"GAA" : 0,
               "GAG" : 0,"GGT" : 0,"GGC" : 0,"GGA" : 0,"GGG" : 0,"CAT" : 0,"CAC" : 0, "ATT" : 0,"ATC" : 0,
               "ATA" : 0,"ATG" : 0,"TTA" : 0,"TTG" : 0,"CTT" : 0,"CTC" : 0,"CTA" : 0,"CTG" : 0,"AAA" : 0,
               "AAG" : 0,"TTT" : 0,"TTC" : 0,"CCT" : 0,"CCC" : 0,"CCA" : 0,"CCG" : 0, "TCT" : 0,"TCC" : 0,
               "TCA" : 0,"TCG" : 0,"AGT" : 0,"AGC" : 0,"ACT" : 0,"ACC" : 0,"ACA" : 0,"ACG" : 0,"TGG" : 0,
               "TAT" : 0,"TAC" : 0,"GTT" : 0,"GTC" : 0,"GTA" : 0,"GTG" : 0,"TAA" : 0,"TGA" : 0,"TAG" : 0,}



for seq_record in SeqIO.parse("NC_000913.gbk", "genbank"):
    sequencefull=seq_record.seq
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS":
            location=seq_feature.location
            start=seq_feature.location.start
            end = seq_feature.location.end
            if seq_feature.strand == -1:
                Sequencetemp = str(sequencefull[start:end:-1])
                for n in range(0, len(Sequencetemp), 3):
                    codon = Sequencetemp[n:n+3]
                    codon_count[codon] += 1
            elif seq_feature.strand == 1:
                Sequencetemp = str(sequencefull[start:end:1])
                for n in range(0, len(Sequencetemp), 3):
                    codon = Sequencetemp[n:n+3]
                    codon_count[codon] += 1
            else:
                Sequencetemp = str(sequencefull[start:end:1])
                for n in range(0, len(Sequencetemp), 3):
                    codon = Sequencetemp[n:n+3]
                    codon_count[codon] += 1

for key, value in codon_count.items() :
    print (key, value)

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by cameronmartino • 0

0

Entering edit mode

I'm not entirely sure what's causing the bug. People will include partial CDSs in their annotation, but in my experience, annotations are not very well standardized and the chances of running into malformed data can be high. If a CDS is partial, you may have to pay special attention to the /codon_start annotation, because it will tell you to skip one or two bases. Depending on how important it is to you to include every CDS possible, you could attempt to prevent some of these errors beforehand, for example by immediately checking if the CDS length %3 == 0 and only entering the loop if it is. Some other suggestions that may or may not help you:

1) Rather than use your if/else statements to try and extract the sequences, Biopython already has some functions to accomplish the same thing. The extract method of a SeqFeature object will extract the sequence for you, and then you can use the .reverse_complement() method if the CDS is on the minus strand. You can therefore avoid trying to extract the sequence yourself using list slicing (which, by the way, doesn't take care of the complement part of the negative strand, just the reverse part).

2) I've never tried to use the feature location/start/end to extract sequences, so this may or may not be true, but I would be careful about "human-readable" indexing vs computer indexing. If you have a sequence, ATGCCCGGGTAA, the Genbank feature would be from 1..12, but the string in Python is indexed 0..11. When you save the feature start/end, if the values of those are 1/12 instead of 0/11, when you plug those numbers into the Python string slicing (e.g. sequencefull[start:end]), you might end up with the wrong data ("TGCCCGGGTAA" instead of the whole string). Also keep in mind that list slicing in Python is exclusive at the endpoint. I suspect that this might be causing your unexpected partial codon, because the program isn't slicing in the "human" way-- it might be reading TGC/CCG/GGT/AA. The other advantage of using the extract method mentioned above is that it (according to the docs) will handle exon/intron splicing, which may not be a problem if you're only working with bacteria but would be a huge problem in eukaryotic data sets.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.8 years ago by athey.johnc ▴ 60