Help With Counting Script
4
0
Entering edit mode
12.4 years ago
Confused ▴ 50

I am very new to bioperl and am trying to write a script that counts the number of sequences, number of characters, %GC content and looks for leucine zipper motifs if possible.

Here is what I have come up with thus far

#!/usr/bin/perl -w

use Bio::SeqIO;
my $seqfile = "t4.fasta" ;
my $in = Bio::SeqIO->new(-format=>'fasta',
                         -file=> $seqfile );

my $count = 0;
while ( my $seq = $in->next_seq ) {
}
print "There are $count sequences\n";

I was wondering if anyone could give me some pointers on how to get the %GC content and how to find the motif?

Thanks a million!

counts gc motif bioperl • 5.1k views
ADD COMMENT
5
Entering edit mode
12.4 years ago

It might not be exactly what you wanted. But here is a quick and dirty python script to get GC content, total base pairs, and number of sequences:

import sys

inFile = open(sys.argv[1],'r')

totalBP = 0
gcBP = 0
headerCount = 0
for line in inFile:
    if line[0] == ">":
        headerCount += 1
    else:
        seqLine = line.strip().lower()
        totalBP += len(seqLine)
        gcBP += seqLine.count('g')
        gcBP += seqLine.count('c')

print 'number of sequences: ' + str(headerCount)
print 'total base pairs: ' + str(totalBP)
print 'gc content: ' + str(float(gcBP) / totalBP)

Save it as YourName.py and run it by: python YourName.py mySequences.fa

ADD COMMENT
0
Entering edit mode

Sorry to drag up such an old thread, but I've just found this as I need to do the same thing! I don't have much experience with python, but a specific question here: if you passed this script a multifasta, would the totalBP counter reflect the totalBP of the WHOLE multifasta, or would it count each 'sub-fasta' when calculating the gc content of each? I realise this is somewhat academic because you could just as easily pass this each fasta in sequence in a loop to be sure but I thought I'd ask.

ADD REPLY
0
Entering edit mode

the scripts runs on the entirety of the file and produces the total counts for all sequences

ADD REPLY
2
Entering edit mode
12.4 years ago
Andrew W ▴ 290

my $na = $seq->seq; my $len = length $na; $totchrs += $len; # declare $totchrs outside your while loop my $gc = $na =~ tr/gcGC//; my $pergc = $gc/ $len * 100; $pergc = sprintf "%3.2f", $per_gc; # if you want it formatted

Or you can look at this script:

https://github.com/bioperl/bioperl-live/blob/master/scripts/seqstats/bp_gccalc.pl

As for finding the motif count, perhaps this is useful:

http://www.bioperl.org/wiki/FAQ#How_do_I_do_motif_searches_with_BioPerl.3F_Can_I_do_.22find_all_sequences_that_are_75.25_identical.22_to_a_given_motif.3F

ADD COMMENT
0
Entering edit mode
12.2 years ago
Woa ★ 2.9k

Alternatively for counting how many sequences, you can use this:

use Bio::SeqIO;
my $fh = Bio::SeqIO->newFh(-file=>"myfile.fasta",-format=>'Fasta');
my @a= <$fh>;
print scalar @a;
ADD COMMENT

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6