GC percent script; reading a fasta formatted data
2
0
Entering edit mode
6.1 years ago
ggman ▴ 80

Hi all. I have a question regarding a perl script I have already made. I need it to be able to read fasta formatted files with multiple sequences. My script below can tell me the GC percent for one whole file not taking into account the headers sadly, needed some advice on going in the right direction (somewhat new to scripting).

 

#!/usr/bin/perl
use strict;
use warnings;
my $DNA_filename;
my @DNA;
my $cg;
my $total;
my $e;
my $DNA;
my $num;
my $outputfile;
print "type the filename of the DNA sequence data: ";
$DNA_filename = <STDIN>;
chomp $DNA_filename;
unless    ( open(DNAFILE, $DNA_filename) ) {
    print "Cannot open file \"$DNA_filename\"\n\n";
    exit;
}
@DNA = <DNAFILE>;
close DNAFILE;
$DNA = join( '', @DNA);
$DNA =~ s/\s//g;
$cg = 0;$total = 0;$e = 0;
while ($DNA =~ /c/ig){$cg++}
while ($DNA =~ /g/ig){$cg++}
while ($DNA =~ /a/ig){$total++}
while ($DNA =~ /c/ig){$total++}
while ($DNA =~ /g/ig){$total++}
while ($DNA =~ /t/ig){$total++}
while ($DNA =~ /[^acgt]/ig){$e++}
print "CG=$cg total nucleotides = $total errors=$e\n";
$outputfile = "percent cg";
unless ( open(PERCENT_CG, ">$outputfile") ) {
    print "Cannot open file \"$outputfile\"\n\n";
    exit;
}
print PERCENT_CG "CG =$cg total nucleotides=$total errors=$e\n\n";
$num = ($cg / $total) *100 ;
print "$num\n\n";
close(PERCENT_CG);
exit;
alignment sequencing • 3.2k views
ADD COMMENT
0
Entering edit mode

I haven't tested any of this code for functionality but it should be close enough to get you started.

  1. Takes Ram's advice to heart.
  2. If you truly want to roll your own there's two options you can take a few routes. I'd suggest you do one of two things:

Break your GC content calulator into a sub routine and read through your files getting one record at a time. once you have a complete record in , pass it to your GC calculator. Something like this:

open (IN , '<' , file.fas) || die $!;

my $header = '';
my $seq = '';
while (my $line = <IN>){
    chomp $line;
    if ($line =~/^>/){
        my $gc = calc_gc($seq) if ($seq);
        $seq = '';
        $header = '';
        }
    else{
        $seq .= $line;
        }

    }

exit;

sub calc_gc{
    my $seq = shift;
    my $GC = 0;
    my $total = length($seq);
    while ($seq=~/G|C/ig){
        $GC++;
        }
    return $GC/$total;
    }

OR you can leverage a special variable in perl $\ , Go read perl var for more infor but something like this should get you started:

my $term = $\;
$\ = '>';

open (IN , '<' , file.fas) || die $!;

my @records = <IN>;

map {
    calc_gc($_) if $_;
    }@records;

exit;

sub calc_gc{
    my $seq = shift;
    my $GC = 0;
    my $total = length($seq);
    while ($seq=~/G|C/ig){
        $GC++;
        }
    return $GC/$total;
    }
ADD REPLY
3
Entering edit mode
6.1 years ago

BBTools has a program which will do this rapidly:

stats.sh in=file.fasta

It will also give you other statistics, which you can ignore, but they're very handy for evaluating assemblies. Or you can use countgc.sh which won't give the other statistics.

Stats is kind of a neat program in that it will run fast on even gigantic fasta files with trillions of bases and millions of contigs using only a tiny amount of RAM. That's easy for gc-counting, but a bit harder for general assembly stats like L50/N50.

ADD COMMENT
3
Entering edit mode
6.1 years ago
Ram 32k
  • Don't reinvent the wheel, look for established tools that do the GC percent analysis for you.
  • You're looping over the sequence 6 times for something that can be done in 2-3 steps
  • You should ideally use Bio::Seq::IO, it takes care of niche problems you may not foresee
  • If you're manually reading input as lines, always chomp the lines (this is an example of the problem BioPerl will take care of)
  • Not sure if the regex match operator you're using will give you any useful result - owing to the fact that you're trying to match a,t,g and c once with every sequence. All sequences will contain at least one of a,t,g and c each, and the count will always be 1.

How to solve your problems?

Check for existing tools, or use BioPerl, or use simple UNIX commands, or use substitute (not match) to find gc count. E.g.: len(upper("CAcaTg"))-len(sub(upper("CAcaTg"),"C","")) will give you number of 'C's in the sequence.

Good luck!

EDIT: Updated with more details and explanations

ADD COMMENT
0
Entering edit mode

Will do, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6