Question: count word ocurrences in fasta file
0
gravatar for erick_rc93
8 months ago by
erick_rc930
erick_rc930 wrote:

I would like to do a perl script which can count all word ocurrences in all multifasta files in a directory and the output looks like

Number of occurrences in >name_sequence is : #

I have tried with

for file in *.fasta; do perl -lne 'if($_ =~ /word/) { $a++;} ; END { print "No of words in the file:"; print $a}' $file; done

I also would like that the script count the sequences length, I put the next script in a for loop in shell

#!/usr/bin/perl -w

use strict;
open( FASTA, $ARGV[0] ) or die "Cannot open fasta file: $!\n";
my $total = 0; #create a counter set to zero
my $name = ''; #to record the name of the current sequence
while( <FASTA> )
{
        chomp;
        if( $_ =~ />/ )
        {
        if( $total > 0 )
        {
        print "$name $total \n";
        }
        $total = 0; #reset the counter
        $name = $_;
}
else
{
$total = $total + length( $_ ); #add to the length
}
}
close( FASTA );

I saved the above code in a text file named count.pl an run the next code

for file in *.fna; do ./count.pl $file; done

I'd like to do these two task in one alone, but I don't know how to do it.

sequence • 366 views
ADD COMMENTlink modified 8 months ago by JC7.6k • written 8 months ago by erick_rc930

Looks like an assignment exercise. Is it one, erick_rc93?

ADD REPLYlink written 8 months ago by RamRS20k

Your Perl one-liner doesn't really count the number of occurrences of word, it counts the numbers of lines where word occurred. Test example, save it as test.fasta:

>seq1
ATCGCCCCCCCCCCCCCCATCG

The pattern ATCG occurs twice, but it will be counted only once.

ADD REPLYlink modified 8 months ago • written 8 months ago by h.mon24k
2
gravatar for JC
8 months ago by
JC7.6k
Mexico
JC7.6k wrote:

There are several ways to do this, also there are unknown conditions, do you need all words including overlapped words? Do you care for words in the reverse-complement chain? The simplest solution I can think is to split your sequences and count array elements:

#!/usr/bin/perl

use strict;
use warnings;

my $word = "ATG"; # here you define the word to count, non-overlaps in forward chain are considered only
$/ = "\n>"; # reads sequences in Fasta by blocks
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence in one single string
    my @subseq = split (/$word/, $seq);
    print "sequence $id has $#subseq occurences of $word\n";
}
ADD COMMENTlink written 8 months ago by JC7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1072 users visited in the last hour