Question: GC percent script; reading a fasta formatted data
0
gravatar for ggman
4.2 years ago by
ggman80
United States
ggman80 wrote:

Hi all. I have a question regarding a perl script I have already made. I need it to be able to read fasta formatted files with multiple sequences. My script below can tell me the GC percent for one whole file not taking into account the headers sadly, needed some advice on going in the right direction (somewhat new to scripting).

 

#!/usr/bin/perl
use strict;
use warnings;
my $DNA_filename;
my @DNA;
my $cg;
my $total;
my $e;
my $DNA;
my $num;
my $outputfile;
print "type the filename of the DNA sequence data: ";
$DNA_filename = <STDIN>;
chomp $DNA_filename;
unless    ( open(DNAFILE, $DNA_filename) ) {
    print "Cannot open file \"$DNA_filename\"\n\n";
    exit;
}
@DNA = <DNAFILE>;
close DNAFILE;
$DNA = join( '', @DNA);
$DNA =~ s/\s//g;
$cg = 0;$total = 0;$e = 0;
while ($DNA =~ /c/ig){$cg++}
while ($DNA =~ /g/ig){$cg++}
while ($DNA =~ /a/ig){$total++}
while ($DNA =~ /c/ig){$total++}
while ($DNA =~ /g/ig){$total++}
while ($DNA =~ /t/ig){$total++}
while ($DNA =~ /[^acgt]/ig){$e++}
print "CG=$cg total nucleotides = $total errors=$e\n";
$outputfile = "percent cg";
unless ( open(PERCENT_CG, ">$outputfile") ) {
    print "Cannot open file \"$outputfile\"\n\n";
    exit;
}
print PERCENT_CG "CG =$cg total nucleotides=$total errors=$e\n\n";
$num = ($cg / $total) *100 ;
print "$num\n\n";
close(PERCENT_CG);
exit;
sequencing alignment • 2.3k views
ADD COMMENTlink modified 4.2 years ago by Brian Bushnell16k • written 4.2 years ago by ggman80

I haven't tested any of this code for functionality but it should be close enough to get you started.

  1. Takes Ram's advice to heart.
  2. If you truly want to roll your own there's two options you can take a few routes. I'd suggest you do one of two things:

Break your GC content calulator into a sub routine and read through your files getting one record at a time. once you have a complete record in , pass it to your GC calculator. Something like this:

open (IN , '<' , file.fas) || die $!;

my $header = '';
my $seq = '';
while (my $line = <IN>){
    chomp $line;
    if ($line =~/^>/){
        my $gc = calc_gc($seq) if ($seq);
        $seq = '';
        $header = '';
        }
    else{
        $seq .= $line;
        }

    }

exit;

sub calc_gc{
    my $seq = shift;
    my $GC = 0;
    my $total = length($seq);
    while ($seq=~/G|C/ig){
        $GC++;
        }
    return $GC/$total;
    }

OR you can leverage a special variable in perl $\ , Go read perl var for more infor but something like this should get you started:

my $term = $\;
$\ = '>';

open (IN , '<' , file.fas) || die $!;

my @records = <IN>;

map {
    calc_gc($_) if $_;
    }@records;

exit;

sub calc_gc{
    my $seq = shift;
    my $GC = 0;
    my $total = length($seq);
    while ($seq=~/G|C/ig){
        $GC++;
        }
    return $GC/$total;
    }
ADD REPLYlink modified 9 months ago by RamRS22k • written 4.2 years ago by dylan.storey60
3
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBTools has a program which will do this rapidly:

stats.sh in=file.fasta

It will also give you other statistics, which you can ignore, but they're very handy for evaluating assemblies. Or you can use countgc.sh which won't give the other statistics.

Stats is kind of a neat program in that it will run fast on even gigantic fasta files with trillions of bases and millions of contigs using only a tiny amount of RAM. That's easy for gc-counting, but a bit harder for general assembly stats like L50/N50.

ADD COMMENTlink modified 9 months ago by RamRS22k • written 4.2 years ago by Brian Bushnell16k
3
gravatar for RamRS
4.2 years ago by
RamRS22k
Houston, TX
RamRS22k wrote:
  • Don't reinvent the wheel, look for established tools that do the GC percent analysis for you.
  • You're looping over the sequence 6 times for something that can be done in 2-3 steps
  • You should ideally use Bio::Seq::IO, it takes care of niche problems you may not foresee
  • If you're manually reading input as lines, always chomp the lines (this is an example of the problem BioPerl will take care of)
  • Not sure if the regex match operator you're using will give you any useful result - owing to the fact that you're trying to match a,t,g and c once with every sequence. All sequences will contain at least one of a,t,g and c each, and the count will always be 1.

How to solve your problems?

Check for existing tools, or use BioPerl, or use simple UNIX commands, or use substitute (not match) to find gc count. E.g.: len(upper("CAcaTg"))-len(sub(upper("CAcaTg"),"C","")) will give you number of 'C's in the sequence.

Good luck!

EDIT: Updated with more details and explanations

ADD COMMENTlink modified 9 months ago • written 4.2 years ago by RamRS22k

Will do, thanks!

ADD REPLYlink written 4.2 years ago by ggman80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 952 users visited in the last hour