Question: count word ocurrences in fasta file
gravatar for erick_rc93
14 months ago by
erick_rc930 wrote:

I would like to do a perl script which can count all word ocurrences in all multifasta files in a directory and the output looks like

Number of occurrences in >name_sequence is : #

I have tried with

for file in *.fasta; do perl -lne 'if($_ =~ /word/) { $a++;} ; END { print "No of words in the file:"; print $a}' $file; done

I also would like that the script count the sequences length, I put the next script in a for loop in shell

#!/usr/bin/perl -w

use strict;
open( FASTA, $ARGV[0] ) or die "Cannot open fasta file: $!\n";
my $total = 0; #create a counter set to zero
my $name = ''; #to record the name of the current sequence
while( <FASTA> )
        if( $_ =~ />/ )
        if( $total > 0 )
        print "$name $total \n";
        $total = 0; #reset the counter
        $name = $_;
$total = $total + length( $_ ); #add to the length
close( FASTA );

I saved the above code in a text file named an run the next code

for file in *.fna; do ./ $file; done

I'd like to do these two task in one alone, but I don't know how to do it.

sequence • 527 views
ADD COMMENTlink modified 14 months ago by JC8.7k • written 14 months ago by erick_rc930

Looks like an assignment exercise. Is it one, erick_rc93?

ADD REPLYlink written 14 months ago by RamRS24k

Your Perl one-liner doesn't really count the number of occurrences of word, it counts the numbers of lines where word occurred. Test example, save it as test.fasta:


The pattern ATCG occurs twice, but it will be counted only once.

ADD REPLYlink modified 14 months ago • written 14 months ago by h.mon27k
gravatar for JC
14 months ago by
JC8.7k wrote:

There are several ways to do this, also there are unknown conditions, do you need all words including overlapped words? Do you care for words in the reverse-complement chain? The simplest solution I can think is to split your sequences and count array elements:


use strict;
use warnings;

my $word = "ATG"; # here you define the word to count, non-overlaps in forward chain are considered only
$/ = "\n>"; # reads sequences in Fasta by blocks
while (<>) {
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence in one single string
    my @subseq = split (/$word/, $seq);
    print "sequence $id has $#subseq occurences of $word\n";
ADD COMMENTlink written 14 months ago by JC8.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2017 users visited in the last hour