Question: Perl Script - For 3Rd Codon G Or C ?
6
gravatar for Ads-Osu
9.0 years ago by
Ads-Osu60
Ads-Osu60 wrote:

So I am trying to write a perl script that gives me the number of Gs or Cs in the third codon position. I have started using a bioperl package but cannot get it to output the name of the sequences in the FASTA.

Also, I have been unsuccessful in extracting the third position thus far. I have got to the point of separating out three base pairs at a time. Here is my code:

#!/usr/bin/perl 
use Bio::SeqIO;
$seqio_obj = Bio::SeqIO->new( -file => "./all_ORFS.fasta" , '-format' => 'Fasta');

 while ($seq_obj = $seqio_obj->next_seq){
    #print the sequence
    $seq1 = $seq_obj->seq,"\n";
    $gc = 0;
    foreach ($seq1){
        @chunk = unpack("A3" x (length($seq1)/3), $seq1);
        while ($chunk){
            if(/\w{2}(\w)/){
                if($1 = "g"){
                    $gc + 1;
                }
            print $gc;
            }
        }
        #print join("\n",@chunk)."\n";    
    }
}
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.0 years ago by Ads-Osu60

Well, for starters...$1 = "g" is not the same as $1 == 'g'. I'm pretty sure that $gc + 1 is not the right syntax either. And "if $chunk {" isn't going to work either, because I'm pretty sure you haven't initialized $chunk. If you started your script with "use strict" that would be caught.

ADD REPLYlink written 10 months ago by swbarnes26.0k
8
gravatar for Neilfws
9.0 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Good to see that you are using Bioperl for this task. Rule 1: don't reinvent the wheel. Here is one solution. It assumes that your input sequence length is a multiple of 3 and contains only codons, beginning with the start codon.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $fasta = Bio::SeqIO->new(-file => "orf1.fa", -format => "fasta");

while(my $seq = $fasta->next_seq) {
  # should really check here that $seq is as expected
  my @output = $seq->id;
  my %codons;
  for(my $i = 3; $i <= $seq->length; $i+=3) {
    my $codon3 = $seq->subseq($i, $i);
       $codons{$codon3} += 1;
  }
  foreach my $base(sort keys %codons) {
    push(@output, $codons{$base});
  }
  print join("\t", @output), "\n";
}

First, we read in a fasta file (orf1.fa) and loop through each sequence, using the Bio::SeqIO next_seq() method. We get the ID of the sequence (that's the part directly following the ">" in the fasta file) using $seq->id and store it as the first element of array @output.

Next, we loop over the sequence 3 bases at a time, starting with the 3rd base. We use the subseq() method, specifying start = end, to extract only base 3 for each codon. We store each base as a key in the hash %codons and increment the value for that key each time we see that base.

Finally we loop through the keys of %codons, sorting in the order A, C, G, T and push the count for each base onto @output. Then we can print out the sequence ID and the four counts by joining the array elements with a tab. Testing with the coding region of this sequence saved as file orf1.fa gives this output:

nirS    33    298    222    44
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.0 years ago by Neilfws48k
6
gravatar for brentp
9.0 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

this line:

if($1 = "g"){

should probably be:

if($1 eq "g"){

but you can simplify by doing:

if (/\w\wg/){
    $gc += 1;
 }

Also, I dont have a feel for speed in perl, but it might be faster to just increment a number by 3 and test if substr($seq, $number, 1) eq 'g'

Also note that where your comment says something about printing, you're not actually call print.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.0 years ago by brentp23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1556 users visited in the last hour