Question: Help With Counting Script
0
gravatar for Confused
8.1 years ago by
Confused50
Confused50 wrote:

I am very new to bioperl and am trying to write a script that counts the number of sequences, number of characters, %GC content and looks for leucine zipper motifs if possible.

Here is what I have come up with thus far

#!/usr/bin/perl -w

use Bio::SeqIO;
my $seqfile = "t4.fasta" ;
my $in = Bio::SeqIO->new(-format=>'fasta',
                         -file=> $seqfile );

my $count = 0;
while ( my $seq = $in->next_seq ) {
}
print "There are $count sequences\n";

I was wondering if anyone could give me some pointers on how to get the %GC content and how to find the motif?

Thanks a million!

gc bioperl motif counts • 3.9k views
ADD COMMENTlink modified 7.9 years ago by Andrew W290 • written 8.1 years ago by Confused50
5
gravatar for Damian Kao
8.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

It might not be exactly what you wanted. But here is a quick and dirty python script to get GC content, total base pairs, and number of sequences:

import sys

inFile = open(sys.argv[1],'r')

totalBP = 0
gcBP = 0
headerCount = 0
for line in inFile:
    if line[0] == ">":
        headerCount += 1
    else:
        seqLine = line.strip().lower()
        totalBP += len(seqLine)
        gcBP += seqLine.count('g')
        gcBP += seqLine.count('c')

print 'number of sequences: ' + str(headerCount)
print 'total base pairs: ' + str(totalBP)
print 'gc content: ' + str(float(gcBP) / totalBP)

Save it as YourName.py and run it by: python YourName.py mySequences.fa

ADD COMMENTlink written 8.1 years ago by Damian Kao15k

Sorry to drag up such an old thread, but I've just found this as I need to do the same thing! I don't have much experience with python, but a specific question here: if you passed this script a multifasta, would the totalBP counter reflect the totalBP of the WHOLE multifasta, or would it count each 'sub-fasta' when calculating the gc content of each? I realise this is somewhat academic because you could just as easily pass this each fasta in sequence in a loop to be sure but I thought I'd ask.

ADD REPLYlink written 3.7 years ago by Joe15k

the scripts runs on the entirety of the file and produces the total counts for all sequences

ADD REPLYlink written 3.7 years ago by Istvan Albert ♦♦ 82k
2
gravatar for Andrew W
8.1 years ago by
Andrew W290
Andrew W290 wrote:

my $na = $seq->seq; my $len = length $na; $totchrs += $len; # declare $totchrs outside your while loop my $gc = $na =~ tr/gcGC//; my $pergc = $gc/ $len * 100; $pergc = sprintf "%3.2f", $per_gc; # if you want it formatted

Or you can look at this script:

https://github.com/bioperl/bioperl-live/blob/master/scripts/seqstats/bp_gccalc.pl

As for finding the motif count, perhaps this is useful:

http://www.bioperl.org/wiki/FAQ#How_do_I_do_motif_searches_with_BioPerl.3F_Can_I_do_.22find_all_sequences_that_are_75.25_identical.22_to_a_given_motif.3F

ADD COMMENTlink modified 7.5 years ago by Istvan Albert ♦♦ 82k • written 8.1 years ago by Andrew W290
0
gravatar for Woa
7.9 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Alternatively for counting how many sequences, you can use this:

use Bio::SeqIO;
my $fh = Bio::SeqIO->newFh(-file=>"myfile.fasta",-format=>'Fasta');
my @a= <$fh>;
print scalar @a;
ADD COMMENTlink written 7.9 years ago by Woa2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 896 users visited in the last hour