Reading CDS regions from a gbk file and counting amino acid codons in Biopython
1
0
Entering edit mode
6.2 years ago

Hello,

I am having a lot of trouble with a program that I am writing and I cannot find my error. The program is two fold, first it is supposed to take a gbk file and parse all of the coding regions (CDS regions). Then it should take the total coding regions com pined and count the numbed of amino acid codons. The program runs fine but the output is wrong, I have checked this with several reputable sources. Please help, I have included my code, my output and what the correct output should look like. 

thank you, 

 

gbk CDS aminoacid codon • 2.3k views
ADD COMMENT
0
Entering edit mode

Code:

from Bio import SeqIO

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna


Sequencetemp = Seq("", generic_dna)
sequencefull = Seq("", generic_dna)
concatenated = Seq("", generic_dna)
key=0
value=0
i=0


def check(Sequencetemp):
    
    if Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TAA" or Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TAG" or  Sequencetemp[len(Sequencetemp)-3:len(Sequencetemp)] is "TGA": 
        return 1 
    

for seq_record in SeqIO.parse("NC_000913.gbk", "genbank"):
    sequencefull=seq_record.seq
    for seq_feature in seq_record.features:
        if seq_feature.type=='CDS':
            if "locus_tag" in seq_feature.qualifiers and "product" in seq_feature.qualifiers:
                start = seq_feature.location.start
                end = seq_feature.location.end
                if seq_feature.strand == -1:
                    Sequencetemp = sequencefull[start:end:-1]
                elif seq_feature.strand == 1:
                    Sequencetemp = sequencefull[start:end:1]   
                concatenated = concatenated + Sequencetemp
       
    
    
if len(concatenated)%3!=0:
    print("Error: CDS regions not read correctly")

print('\n\n aa   codon  total\n')

codons = {"Ala/A" : ["GCC","GCT","GCA"],"Arg/R" : ["CGT","CGC","CGA","CGG","AGA","AGG"],
          "Asn/N" : ["AAT","AAC"], "Asp/D" : ["GAT","GAC"],"Cys/C" : ["TGT","TGC"], 
          "Gln/Q" : ["CAA","CAG"],"Glu/E" : ["GAA","GAG"], "Gly/G" : ["GGT","GGC","GGA","GGG"],
          "His/H" : ["CAT","CAC"], "Ile/I" : ["ATT","ATC","ATA"],"Met/M" : ["ATG"], 
          "Leu/L" : ["TTA","TTG","CTT","CTC","CTA","CTG"],"Lys/K" : ["AAA","AAG"], 
          "Phe/F" : ["TTT","TTC"],"Pro/P" : ["CCT","CCC","CCA","CCG"], 
          "Ser/S" : ["TCT","TCC","TCA","TCG", "AGT", "AGC"], "Thr/T" : ["ACT","ACC","ACA","ACG"],
          "Trp/W" : ["TGG"], "Tyr/Y" : ["TAT","TAC"], "Val/V" : ["GTT","GTC","GTA","GTG"], 
          "STOP " : ["TAA","TGA","TAG"],}

for key, value in codons.items():
    for i in value:
        print('%s  %s   %d'%(key,  i, concatenated.count(i)))

 

 

MY OUTPUT:

 

aa   codon  total

 

His/H  CAT   25386

His/H  CAC   22411

Ile/I  ATT   35128

Ile/I  ATC   32697

Ile/I  ATA   17829

Met/M  ATG   37031

Ala/A  GCC   35547

Ala/A  GCT   38110

Ala/A  GCA   39735

Ser/S  TCT   22739

Ser/S  TCC   19936

Ser/S  TCA   27385

Ser/S  TCG   28920

Ser/S  AGT   18238

Ser/S  AGC   31220

Lys/K  AAA   34013

Lys/K  AAG   29834

Tyr/Y  TAT   28033

Tyr/Y  TAC   22115

Gly/G  GGT   34142

Gly/G  GGC   44695

Gly/G  GGA   25934

Gly/G  GGG   18308

Gln/Q  CAA   30002

Gln/Q  CAG   36056

Asp/D  GAT   39449

Asp/D  GAC   22989

Arg/R  CGT   32599

Arg/R  CGC   44869

Arg/R  CGA   32570

Arg/R  CGG   36486

Arg/R  AGA   19984

Arg/R  AGG   19233

Trp/W  TGG   49052

Val/V  GTT   32543

Val/V  GTC   22679

Val/V  GTA   20654

Val/V  GTG   31673

Asn/N  AAT   29272

Asn/N  AAC   34935

Leu/L  TTA   28841

Leu/L  TTG   32196

Leu/L  CTT   20728

Leu/L  CTC   16586

Leu/L  CTA   12922

Leu/L  CTG   53563

Pro/P  CCT   20670

Pro/P  CCC   13967

Pro/P  CCA   25441

Pro/P  CCG   38828

Cys/C  TGT   23024

Cys/C  TGC   41620

Phe/F  TTT   26283

Phe/F  TTC   27362

STOP   TAA   22058

STOP   TGA   41221

STOP   TAG   7496

Thr/T  ACT   21936

Thr/T  ACC   29456

Thr/T  ACA   21430

Thr/T  ACG   29764

Glu/E  GAA   41982

Glu/E  GAG   15868

 

 

 

 

Correct Output Should Be:

 

 

aa         codon  total     percent

Ala/A  GCC    33872  %27.0

Ala/A  GCT    20006  %16.0

Ala/A  GCA   26537  %21.2

Ala/A  GCG   44899  %35.8

Arg/R  CGT    27834  %38.4

Arg/R  CGG   6996    %9.6

Arg/R  CGC    29291  %40.4

Arg/R  AGA   2501    %3.4

Arg/R  CGA   4542    %6.3

Arg/R  AGG   1365    %1.9

Asn/N AAT   22752  %44.6

Asn/N AAC   28309  %55.4

Asp/D GAT   42121  %62.6

Asp/D GAC   25198  %37.4

Cys/C  TGC    8484    %55.8

Cys/C  TGT    6712    %44.2

Gln/Q  CAA   20207  %34.6

Gln/Q  CAG   38131  %65.4

Glu/E   GAA   52314  %69.1

Glu/E   GAG   23432  %30.9

Gly/G  GGA   10232  %10.6

Gly/G  GGC   39360  %40.7

Gly/G  GGT   32650  %33.8

Gly/G  GGG   14465  %15.0

His/H  CAT    16940  %56.9

His/H  CAC    12809  %43.1

Ile/I      ATA   5345    %6.8

Ile/I      ATT    40166  %51.0

Ile/I      ATC    33315  %42.3

Leu/L   TTA    18090  %12.9

Leu/L   CTA    5075    %3.6

Leu/L   CTC    14702  %10.5

Leu/L   CTG    70360  %50.0

Leu/L   CTT    14399  %10.2

Leu/L   TTG    17990  %12.8

Lys/K  AAA   44219  %76.8

Lys/K  AAG   13369  %23.2

Met/M            ATG   36683  %100.0

Phe/F   TTC    21822  %42.7

Phe/F   TTT    29300  %57.3

Pro/P   CCT    9135    %15.7

Pro/P   CCG    30966  %53.1

Pro/P   CCC    7136    %12.2

Pro/P   CCA    11063  %19.0

STOP  TGA   1194    %28.7

STOP  TAG   287      %6.9

STOP  TAA   2685    %64.5

Ser/S    TCA    9158    %12.1

Ser/S    TCG    11748  %15.5

Ser/S    TCT    10984  %14.5

Ser/S    AGC   21127  %27.9

Ser/S    AGT   11322  %15.0

Ser/S    TCC    11311  %15.0

Thr/T  ACT    11580  %16.4

Thr/T  ACA   8979    %12.7

Thr/T  ACG   18967  %26.9

Thr/T  ACC    30949  %43.9

Trp/W TGG   20055  %100.0

Tyr/Y  TAT    21057  %56.7

Tyr/Y  TAC    16076  %43.3

Val/V   GTG   34786  %37.3

Val/V   GTC    20224  %21.7

Val/V   GTA   14323  %15.3

Val/V   GTT    24029  %25.7

 

ADD REPLY
1
Entering edit mode
6.2 years ago
athey.johnc ▴ 40

I'm somewhat confused by your question, as I'm not sure which errors you're referring to in particular.

1) I believe the counts that you're seeing are different because the .count() method you're using on the concatenated string does not maintain the proper reading frame. The function is non-overlapping according to this page, but that doesn't mean it keeps the reading frame maintained.

2) If you take issue with the order your data is being printed in, that's because dicts in Python are inherently unordered. If you want it to preserve the order you entered it in, you need to look into using an OrderedDict instead.

3) I'm pretty sure you aren't seeing any percentage data because you never told it to print any. The print statement in your code only has 3 outputs: the amino acid, the codon, and the count for that codon.

What you might try instead is to initialize a dict with the codons and their counts (e.g. codon_count = {'ATG': 0, 'ATT': 0, 'ATC': 0, ... }), and then read through each separate CDS (what I think you call Sequencetemp) in chunks of [n, n+3], e.g.:

for n in range(0, len(Sequencetemp), 3):

     codon = Sequencetemp[n, n+3]

     codon_count[codon] += 1

This will give you the totals for each codon in all the CDSs in the file. Then, at the end, you can iterate through the dict you have, codons, and calculate a total for all the codons in the value of each key; then just divide the value for each codon (in codon_count) by the total for the codons encoding the same AA. (I haven't tested this code exactly, but I have a similar program that calculates the same data in a slightly different way.)

ADD COMMENT
0
Entering edit mode

Thank you for the reply it was very helpful.

1) I believe you are right that the frame is not correctly maintained, I believe the problem is not all of the start/stop are correct. What I mean by this is that sequence_temp%3 is not always true. Do you have any idea why this would be?

2) and 3) I wanted to get the counts right before I worried about formatting and calculating the percentages.

I took your advice and re wrote the code in an attempt to just get the correct counts and I am getting an error where on of the codons only contains two letters. Specifically the error message is

---> 32 codon_count[codon] += 1
     33 else:
     34    Sequencetemp = str(sequencefull[start:end:1])

KeyError: 'AA'

My new code:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

codon_count = {"GCC" : 0,"GCT" : 0,"GCA" : 0,"GCG" : 0,"CGT" : 0 ,"CGC" : 0,"CGA" : 0,"CGG" : 0,"AGA" : 0,"AGG" : 0, 
               "AAT" : 0,"AAC" : 0,"GAT" : 0,"GAC" : 0, "TGT" : 0,"TGC" : 0,"CAA" : 0,"CAG" : 0,"GAA" : 0,
               "GAG" : 0,"GGT" : 0,"GGC" : 0,"GGA" : 0,"GGG" : 0,"CAT" : 0,"CAC" : 0, "ATT" : 0,"ATC" : 0,
               "ATA" : 0,"ATG" : 0,"TTA" : 0,"TTG" : 0,"CTT" : 0,"CTC" : 0,"CTA" : 0,"CTG" : 0,"AAA" : 0,
               "AAG" : 0,"TTT" : 0,"TTC" : 0,"CCT" : 0,"CCC" : 0,"CCA" : 0,"CCG" : 0, "TCT" : 0,"TCC" : 0,
               "TCA" : 0,"TCG" : 0,"AGT" : 0,"AGC" : 0,"ACT" : 0,"ACC" : 0,"ACA" : 0,"ACG" : 0,"TGG" : 0,
               "TAT" : 0,"TAC" : 0,"GTT" : 0,"GTC" : 0,"GTA" : 0,"GTG" : 0,"TAA" : 0,"TGA" : 0,"TAG" : 0,}



for seq_record in SeqIO.parse("NC_000913.gbk", "genbank"):
    sequencefull=seq_record.seq
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS":
            location=seq_feature.location
            start=seq_feature.location.start
            end = seq_feature.location.end
            if seq_feature.strand == -1:
                Sequencetemp = str(sequencefull[start:end:-1])
                for n in range(0, len(Sequencetemp), 3):
                    codon = Sequencetemp[n:n+3]
                    codon_count[codon] += 1
            elif seq_feature.strand == 1:
                Sequencetemp = str(sequencefull[start:end:1])
                for n in range(0, len(Sequencetemp), 3):
                    codon = Sequencetemp[n:n+3]
                    codon_count[codon] += 1
            else:
                Sequencetemp = str(sequencefull[start:end:1])
                for n in range(0, len(Sequencetemp), 3):
                    codon = Sequencetemp[n:n+3]
                    codon_count[codon] += 1

for key, value in codon_count.items() :
    print (key, value)
ADD REPLY
0
Entering edit mode

I'm not entirely sure what's causing the bug. People will include partial CDSs in their annotation, but in my experience, annotations are not very well standardized and the chances of running into malformed data can be high. If a CDS is partial, you may have to pay special attention to the /codon_start annotation, because it will tell you to skip one or two bases. Depending on how important it is to you to include every CDS possible, you could attempt to prevent some of these errors beforehand, for example by immediately checking if the CDS length %3 == 0 and only entering the loop if it is. Some other suggestions that may or may not help you:

1) Rather than use your if/else statements to try and extract the sequences, Biopython already has some functions to accomplish the same thing. The extract method of a SeqFeature object will extract the sequence for you, and then you can use the .reverse_complement() method if the CDS is on the minus strand. You can therefore avoid trying to extract the sequence yourself using list slicing (which, by the way, doesn't take care of the complement part of the negative strand, just the reverse part).

2) I've never tried to use the feature location/start/end to extract sequences, so this may or may not be true, but I would be careful about "human-readable" indexing vs computer indexing. If you have a sequence, ATGCCCGGGTAA, the Genbank feature would be from 1..12, but the string in Python is indexed 0..11. When you save the feature start/end, if the values of those are 1/12 instead of 0/11, when you plug those numbers into the Python string slicing (e.g. sequencefull[start:end]), you might end up with the wrong data ("TGCCCGGGTAA" instead of the whole string). Also keep in mind that list slicing in Python is exclusive at the endpoint. I suspect that this might be causing your unexpected partial codon, because the program isn't slicing in the "human" way-- it might be reading TGC/CCG/GGT/AA. The other advantage of using the extract method mentioned above is that it (according to the docs) will handle exon/intron splicing, which may not be a problem if you're only working with bacteria but would be a huge problem in eukaryotic data sets.

ADD REPLY

Login before adding your answer.

Traffic: 2239 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6