Question: Perl Script - For 3Rd Codon G Or C ?
gravatar for Ads-Osu
10.4 years ago by
Ads-Osu60 wrote:

So I am trying to write a perl script that gives me the number of Gs or Cs in the third codon position. I have started using a bioperl package but cannot get it to output the name of the sequences in the FASTA.

Also, I have been unsuccessful in extracting the third position thus far. I have got to the point of separating out three base pairs at a time. Here is my code:

use Bio::SeqIO;
$seqio_obj = Bio::SeqIO->new( -file => "./all_ORFS.fasta" , '-format' => 'Fasta');

 while ($seq_obj = $seqio_obj->next_seq){
    #print the sequence
    $seq1 = $seq_obj->seq,"\n";
    $gc = 0;
    foreach ($seq1){
        @chunk = unpack("A3" x (length($seq1)/3), $seq1);
        while ($chunk){
                if($1 = "g"){
                    $gc + 1;
            print $gc;
        #print join("\n",@chunk)."\n";    
ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by Ads-Osu60

Well, for starters... $1 = "g" is not the same as $1 == 'g'. I'm pretty sure that $gc + 1 is not the right syntax either. And if $chunk { isn't going to work either, because I'm pretty sure you haven't initialized $chunk. If you started your script with use strict that would be caught.

ADD REPLYlink modified 15 months ago by _r_am31k • written 2.2 years ago by swbarnes29.2k
gravatar for Neilfws
10.4 years ago by
Sydney, Australia
Neilfws49k wrote:

Good to see that you are using Bioperl for this task. Rule 1: don't reinvent the wheel. Here is one solution. It assumes that your input sequence length is a multiple of 3 and contains only codons, beginning with the start codon.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $fasta = Bio::SeqIO->new(-file => "orf1.fa", -format => "fasta");

while(my $seq = $fasta->next_seq) {
  # should really check here that $seq is as expected
  my @output = $seq->id;
  my %codons;
  for(my $i = 3; $i <= $seq->length; $i+=3) {
    my $codon3 = $seq->subseq($i, $i);
       $codons{$codon3} += 1;
  foreach my $base(sort keys %codons) {
    push(@output, $codons{$base});
  print join("\t", @output), "\n";

First, we read in a fasta file (orf1.fa) and loop through each sequence, using the Bio::SeqIO next_seq() method. We get the ID of the sequence (that's the part directly following the ">" in the fasta file) using $seq->id and store it as the first element of array @output.

Next, we loop over the sequence 3 bases at a time, starting with the 3rd base. We use the subseq() method, specifying start = end, to extract only base 3 for each codon. We store each base as a key in the hash %codons and increment the value for that key each time we see that base.

Finally we loop through the keys of %codons, sorting in the order A, C, G, T and push the count for each base onto @output. Then we can print out the sequence ID and the four counts by joining the array elements with a tab. Testing with the coding region of this sequence saved as file orf1.fa gives this output:

nirS    33    298    222    44
ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by Neilfws49k
gravatar for brentp
10.4 years ago by
Salt Lake City, UT
brentp23k wrote:

this line:

if($1 = "g"){

should probably be:

if($1 eq "g"){

but you can simplify by doing:

if (/\w\wg/){
    $gc += 1;

Also, I dont have a feel for speed in perl, but it might be faster to just increment a number by 3 and test if substr($seq, $number, 1) eq 'g'

Also note that where your comment says something about printing, you're not actually call print.

ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by brentp23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1450 users visited in the last hour