Question: perl script for multifasta sequence analysis
0
gravatar for prosenjit.paul77
5.7 years ago by
India
prosenjit.paul770 wrote:

I have a perl script which gives the sum result of all the sequences present in the multifasta (.txt format) input file. please help me in modifying the code in such a way that it gives result separately for each sequence.

Input sequence:

>seq1
GADD
>seq2
MMGAAL

Observed output:

GA=2
AL= 1
MM= 1
.
.
.
.
.
Sum=4
Abs=19

Expected output:

Seq1:
GA= 1
AL = 0
.
.
.
Sum=1
Abs=21

Seq2:
GA=1
AL=1
MM=1
.
.
.
Sum=3
Abs=19

Code:

#!/usr/bin/perl -w
print "Please type the filename of the DNA sequence MMta: ";
$str= <STDIN>;
chomp $str;
unless ( open(DNAFILE, $str) )
{
print "Cannot open file \"$str\"\n\n";
exit;
}
@str = <DNAFILE>;
$str=join('',@str);
print"\n DNA:\n
$str\n";

$GA = 0;
$AL = 0;
$MM = 0;
$DE = 0;
$DV = 0;
$VD = 0;
$DW = 0;
$QD = 0;
$SD = 0;
$DD = 0;
$ED = 0;
$DY = 0;
$VE = 0;
$EN = 0;
$II = 0;
$KE = 0;
$NV = 0;
$VP = 0;
$FV = 0;
$SS = 0;
$WK = 0;
$KK = 0;
$abs=0;

while ($str =~ /GA/ig)
{$GA++}
while ($str =~ /AL/ig)
{$AL++}
while ($str =~ /MM/ig)
{$MM++}
while ($str =~ /DE/ig)
{$DE++}
while ($str =~ /DV/ig)
{$DV++}
while ($str =~ /VD/ig)
{$VD++}
while ($str =~ /DW/ig)
{$DW++}
while ($str =~ /QD/ig)
{$QD++}
while ($str =~ /SD/ig)
{$SD++}
while ($str =~ /DD/ig)
{$DD++}
while ($str =~ /ED/ig)
{$ED++}
while ($str =~ /DY/ig)
{$DY++}
while ($str =~ /VE/ig)
{$VE++}
while ($str =~ /EN/ig)
{$EN++}
while ($str =~ /II/ig)
{$II++}
while ($str =~ /KE/ig)
{$KE++}
while ($str =~ /NV/ig)
{$NV++}
while ($str =~ /VP/ig)
{$VP++}
while ($str =~ /FV/ig)
{$FV++}
while ($str =~ /SS/ig)
{$SS++}
while ($str =~ /WK/ig)
{$WK++}
while ($str =~ /KK/ig)
{$KK++}
$total= "$GA+$AL+$MM+$DE+$DV+$VD+$DW+$QD+$SD+$DD+$ED+$DY+$VE+$EN+$II+$KE+$NV+$VP+$FV+$SS+$WK+$KK";
while ($total=~ /0/ig)
{$abs++}
$sum= $GA+$AL+$MM+$DE+$DV+$VD+$DW+$QD+$SD+$DD+$ED+$DY+$VE+$EN+$II+$KE+$NV+$VP+$FV+$SS+$WK+$KK;
print "GA = $GA\n";
print "AL = $AL\n";
print "WK = $MM\n";
print "DE = $DE\n";
print "VP = $DV\n";
print "VD = $VD\n";
print "DW = $DW\n";
print "QD = $QD\n";
print "SD = $SD\n";
print "DD = $DD\n";
print "ED = $ED\n";
print "DY = $DY\n";
print "VE = $VE\n";
print "EN = $EN\n";
print "II = $II\n";
print "KE = $KE\n";
print "NV = $NV\n";
print "VP = $VP\n";
print "FV = $FV\n";
print "SS = $SS\n";
print "WK = $WK\n";
print "KK = $KK\n";
print "sum=$sum\n";
print "abs=$abs";

$outputfile = "countbase.txt";
unless ( open(COUNTBASE, ">$outputfile") ) {
print "Cannot open file \"$outputfile\" to write
to!!\n\n";
exit;
}
print COUNTBASE "$GA
$AL
$MM
$DE
$DV
$VD
$DW
$QD
$SD
$DD
$ED
$DY
$VE
$EN
$II
$KE
$NV
$VP
$FV
$SS
$WK
$KK
$sum
$abs";
close(COUNTBASE);


exit;
sequence • 2.5k views
ADD COMMENTlink modified 5 months ago by _r_am32k • written 5.7 years ago by prosenjit.paul770

What do you mean by "sum result of all the sequences"

ADD REPLYlink written 5.7 years ago by geek_y11k
Sum count of the dipeptide i.e GA, AL.... Present in all the sequences
ADD REPLYlink written 5.7 years ago by prosenjit.paul770
3
gravatar for thackl
5.7 years ago by
thackl2.8k
MIT
thackl2.8k wrote:

Because it's weekend. It is fully functional. I hope you can follow the code, otherwise I'm happy to explain... (Fasta parser is based on Fasta-Parser.pl)

#!/usr/bin/env perl
use warnings;
use strict;

# Usage:
#   perl dinuc.pl FILE.FA

my @dipeps_of_interest = qw(
                               GA
                               AL
                               MM
                               DE
                               DV
                               VD
                               DW
                               QD
                               SD
                               DD
                               ED
                               DY
                               VE
                               EN
                               II
                               KE
                               NV
                               VP
                               FV
                               SS
                               WK
                               KK
                       );


open(FASTA, $ARGV[0]) or die $!;

while( my $fa = next_fasta_record(\*FASTA) ){
    my %fa = fasta_record2hash($fa);

    my %dipeps = count_dipeps($fa{seq});
    # or
    # my %dipeps = count_dipeps_fast($fa{seq});

    # report dipeps counts
    print "$fa{id}\n";
    my $sum = 0;
    my $abs = 0;

    foreach (@dipeps_of_interest) {
        if (exists $dipeps{$_}) {
            $sum+= $dipeps{$_};
            print "$_\t$dipeps{$_}\n";
        }else {
            $abs++;
            print "$_\t0\n";            
        }
    }
    print "sum\t$sum\n";
    print "abs\t$abs\n";
}

close FASTA;


# count dipeps of sequence in a hash
sub count_dipeps{
    my $seq = shift;
    my %dipeps;
    while($seq =~ /(?=(\w\w))/g){ # read to chars at a time
        $dipeps{$1}++;
    }
    return %dipeps;
}

# count dipeps of sequence in a hash
# faster than regex and works on seqs >36kbp
sub count_dipeps_fast{
    my $seq = shift;
    my %dipeps;
    my $length = length($seq)-2;
    foreach(unpack("(A2X1)".$length."A2", $seq)){
        $dipeps{$_}++
    }
    return %dipeps;
}

# read a single fasta record from a file handle into a string
sub next_fasta_record{
    my $fh = shift;
    local $/="\n>"; # change record separator form "\n" to "\n>"
    my $fa = <$fh>; # read a fasta record
    return unless defined($fa); # end of file
    chomp($fa); # remove "\n>" from end of record
    $fa =~ s/^>//; # (stupid way to) fix first record in file
    return '>'.$fa;
}

# split a single fasta record into a hash
sub fasta_record2hash{
    my ($head, $seq) = split("\n", $_[0], 2); # split header and seq
    my ($id, $desc) = split(/\s/, $head, 2); # split id and desc
    $seq =~ tr/\n//d; # remove newlines
    my %fa = (
              id => $id,
              desc => $desc,
              seq => $seq,
              );
    return %fa;
}
ADD COMMENTlink modified 5 months ago by _r_am32k • written 5.7 years ago by thackl2.8k

Sir,Thank you for the help.

I placed the input file (file.fa) in the same directory used the exact code as given but it is shwoing the msg

use of uninitialized value in open at C:\users\prosen\desktop\code.pl line 8.
no such file or directory at" C:\users\prosen\desktop\code.pl line 8 "

Line 8 in the code:

open(FASTA, $ARGV[0]) or die $!;

Sir please help me

ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by prosenjit.paul770
  • Is that windows?
  • are you using strawberry perl?
  • Did you open your command prompt in the directory where the files were located? Finally did you notice this? perl dinuc.pl FILE.FA
ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by venu6.7k

Yes sir i am using windows system,

No, active per,

yes sir in the same directory,

Renamed the input file byfile.fa

ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by prosenjit.paul770

Location and name of the fasta file don't matter, but you need to actually put the path to the fasta file in the command line call to the script:

perl dinuc.pl C:\the\path\to\my\protein.fa

Or at least the name if you first cd to the directory

perl dinuc.pl protein.fa
ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by thackl2.8k

If this does not work, change

open(FASTA, $ARGV[0]) or die $!;

to

open(FASTA, "C:/path/to/my/protein.fa") or die $!;
ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by thackl2.8k

Sir this code is very helpful can be used for DNA and protein both.

But sir i want to count only the 22 dipeptides., their sum and the number of di-peptide absent.......Is there any way to do so...Thanks in advance

ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by prosenjit.paul770
1

Okay, still weekend ;). I modified the above code slightly. You can now specify dipeptides of interest, and sum and count are calculated

ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by thackl2.8k
1
gravatar for venu
5.7 years ago by
venu6.7k
Germany
venu6.7k wrote:
  • Separate each header and its associated sequence into two separate string ($id, $seq)
  • Count the occurence (GA, AL etc.) with $GA = () = $seq =~ /GA/gi;

Simple modification of this will work

ADD COMMENTlink modified 5 months ago by _r_am32k • written 5.7 years ago by venu6.7k

Sir, i used the reference code but it is showing the msg

use of uninitialized value in open at C:\users\prosen\desktop\code.pl line 7

Line 7 in the code:

open IN, $ARGV[0];

please help.

ADD REPLYlink modified 5 months ago by _r_am32k • written 5.7 years ago by prosenjit.paul770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2324 users visited in the last hour
_