Question: count word ocurrences in fasta file
0
gravatar for erick_rc93
2.2 years ago by
erick_rc9310
erick_rc9310 wrote:

I would like to do a perl script which can count all word ocurrences in all multifasta files in a directory and the output looks like

Number of occurrences in >name_sequence is : #

I have tried with

for file in *.fasta; do perl -lne 'if($_ =~ /word/) { $a++;} ; END { print "No of words in the file:"; print $a}' $file; done

I also would like that the script count the sequences length, I put the next script in a for loop in shell

#!/usr/bin/perl -w

use strict;
open( FASTA, $ARGV[0] ) or die "Cannot open fasta file: $!\n";
my $total = 0; #create a counter set to zero
my $name = ''; #to record the name of the current sequence
while( <FASTA> )
{
        chomp;
        if( $_ =~ />/ )
        {
        if( $total > 0 )
        {
        print "$name $total \n";
        }
        $total = 0; #reset the counter
        $name = $_;
}
else
{
$total = $total + length( $_ ); #add to the length
}
}
close( FASTA );

I saved the above code in a text file named count.pl an run the next code

for file in *.fna; do ./count.pl $file; done

I'd like to do these two task in one alone, but I don't know how to do it.

sequence • 842 views
ADD COMMENTlink modified 2.2 years ago by JC11k • written 2.2 years ago by erick_rc9310

Looks like an assignment exercise. Is it one, erick_rc93?

ADD REPLYlink written 2.2 years ago by RamRS30k

Your Perl one-liner doesn't really count the number of occurrences of word, it counts the numbers of lines where word occurred. Test example, save it as test.fasta:

>seq1
ATCGCCCCCCCCCCCCCCATCG

The pattern ATCG occurs twice, but it will be counted only once.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by h.mon31k
2
gravatar for JC
2.2 years ago by
JC11k
Mexico
JC11k wrote:

There are several ways to do this, also there are unknown conditions, do you need all words including overlapped words? Do you care for words in the reverse-complement chain? The simplest solution I can think is to split your sequences and count array elements:

#!/usr/bin/perl

use strict;
use warnings;

my $word = "ATG"; # here you define the word to count, non-overlaps in forward chain are considered only
$/ = "\n>"; # reads sequences in Fasta by blocks
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence in one single string
    my @subseq = split (/$word/, $seq);
    print "sequence $id has $#subseq occurences of $word\n";
}
ADD COMMENTlink written 2.2 years ago by JC11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 712 users visited in the last hour