Question: count word ocurrences in fasta file
gravatar for erick_rc93
2.2 years ago by
erick_rc9310 wrote:

I would like to do a perl script which can count all word ocurrences in all multifasta files in a directory and the output looks like

Number of occurrences in >name_sequence is : #

I have tried with

for file in *.fasta; do perl -lne 'if($_ =~ /word/) { $a++;} ; END { print "No of words in the file:"; print $a}' $file; done

I also would like that the script count the sequences length, I put the next script in a for loop in shell

#!/usr/bin/perl -w

use strict;
open( FASTA, $ARGV[0] ) or die "Cannot open fasta file: $!\n";
my $total = 0; #create a counter set to zero
my $name = ''; #to record the name of the current sequence
while( <FASTA> )
        if( $_ =~ />/ )
        if( $total > 0 )
        print "$name $total \n";
        $total = 0; #reset the counter
        $name = $_;
$total = $total + length( $_ ); #add to the length
close( FASTA );

I saved the above code in a text file named an run the next code

for file in *.fna; do ./ $file; done

I'd like to do these two task in one alone, but I don't know how to do it.

sequence • 842 views
ADD COMMENTlink modified 2.2 years ago by JC11k • written 2.2 years ago by erick_rc9310

Looks like an assignment exercise. Is it one, erick_rc93?

ADD REPLYlink written 2.2 years ago by RamRS30k

Your Perl one-liner doesn't really count the number of occurrences of word, it counts the numbers of lines where word occurred. Test example, save it as test.fasta:


The pattern ATCG occurs twice, but it will be counted only once.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by h.mon31k
gravatar for JC
2.2 years ago by
JC11k wrote:

There are several ways to do this, also there are unknown conditions, do you need all words including overlapped words? Do you care for words in the reverse-complement chain? The simplest solution I can think is to split your sequences and count array elements:


use strict;
use warnings;

my $word = "ATG"; # here you define the word to count, non-overlaps in forward chain are considered only
$/ = "\n>"; # reads sequences in Fasta by blocks
while (<>) {
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence in one single string
    my @subseq = split (/$word/, $seq);
    print "sequence $id has $#subseq occurences of $word\n";
ADD COMMENTlink written 2.2 years ago by JC11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 712 users visited in the last hour