Question: Script To Calculate Dinucleotide Frequency For Many Sequences
1
gravatar for biolab
5.4 years ago by
biolab1.1k
biolab1.1k wrote:

Hi, everyone,

I need to calculate dinucleotide frequency for many sequences. As i know, the R command count(seq, 2, freq = TRUE) can calculate only one sequence. I am new in perl. Does anyone has reliable script to complete this task? I downloaded one script (shown below), but when running it, obviously there is error message:

Use of uninitialized value $G[0] in array element at dinucleotide.pl line 31.
Use of uninitialized value $G[0] in multiplication (*) at dinucleotide.pl line 32.
..................

Thank you very much!

#!/usr/bin/perl -w

##
## Display all dinucleotide frequencies along with their expected
## values under the independence hypothesis.
##

##  Hashes for translating between bases and numbers
%H = ("A", 0, "T", 1, "G", 2, "C", 3);
%HI = (0, "A", 1, "T", 2, "G", 3, "C");

## The data file comes from the command line.
$file = shift @ARGV;

## Read the data.
open(F, "$file") or die "Unable to open $file\n";
while (<F>)
{
    chomp;
    $K = length($_);
    foreach $j (0..$K-1) { push(@G, $H{substr($_, $j, 1)}); }
}
close(F);

## The number of genes.
$L = scalar(@G);

## Get the marginal and joint frequencies.
foreach $i (0..$L-2)
{
    ++$M[$G[$i]];
    ++$J[4 * $G[$i] + $G[$i + 1]];
}

## Normalize.
foreach $j (0..3) { $M[$j] /= ($L - 1); }
foreach $j (0..15) { $J[$j] /= ($L - 1); }

## Display the marginals.
foreach $j (0..3) { print sprintf("%s: %f\n", $HI{$j}, $M[$j]); }
print "\n";

## Display the dinucleotide frequencies.
foreach $j (0..3)
{
    foreach $k (0..3)
    {
    $P = $M[$j] * $M[$k];
    $Q = 4 * $j + $k;
    print sprintf("%s%s: %f\t%f\n", $HI{$j}, $HI{$k}, $J[$Q], $P);
    }
}
perl frequency • 7.8k views
ADD COMMENTlink modified 5.4 years ago by Pavel Senin1.9k • written 5.4 years ago by biolab1.1k

I should note that I do not need to calculate each sequence's dinucleotide frequency. I need to get the di-nt frequency from all sequences. Thank you for your attention.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by biolab1.1k

How are you calling the script? Command line and input file format?

ADD REPLYlink written 5.4 years ago by vodka80

problem solved. thank you for comment.

ADD REPLYlink written 5.4 years ago by biolab1.1k

i also get the same problem, how can i resolve it? 

i used command

perl dinucleotide.pl  -in utr.fasta  > out 

it gives an error :(

ADD REPLYlink written 3.5 years ago by harpreetmanku0410

Hi, EMBOSS is a good tool.  I wrote the following perl script for your reference.

#!/usr/bin/perl -w
use strict;

my ($genome, $head, $tail);
my (%mono_nt, %di_nt);

$/ = ">";
open my $fasta, '<', $ARGV[0] or die $!;
while (<$fasta>) {
    chomp; s/\r//g; s/^\s*|\s*$//;
    if (/.+?\n(.+)/s) {
        (my $seq = $1) =~ s/\n//g;
        $genome .= uc $seq;
        $head = uc substr($seq, 0, 1);
        $di_nt{"$tail$head"}-- if $tail;
        $tail = uc substr($seq, -1);
    }
}
close $fasta;

my $len = length $genome;
for my $i (0..$len-2) {
    my $each_mono_nt = substr($genome, $i, 1);
    my $each_di_nt   = substr($genome, $i, 2);
    $mono_nt{$each_mono_nt}++;
    $di_nt{$each_di_nt}++;
}
$mono_nt{$tail}++;

print "-"x30, "\nSingle nucleotide frequency:\n";
for my $nt (sort keys %mono_nt) {print "$nt\t", $mono_nt{$nt} / $len, "\n";}

print "\n", "-"x30, "\nDinucleotide frequency:\nDinucleotide\tObs. freq.\tExp. freq.\n";
for my $nt_pair (sort keys %di_nt) {
    my ($first_nt, $second_nt) = split //, $nt_pair;
    print "$nt_pair\t", $di_nt{$nt_pair} / ($len-1), "\t",
        $mono_nt{$first_nt} * $mono_nt{$second_nt} /$len /$len, "\n";
}

 

For example, a genome fasta file is as below.

>Chr1
ATGCTAG
>Chr2
aTGGcCA

You can run  perl di_nt_freq.pl  genome.fa  and get the following.

------------------------------
Single nucleotide frequency:
A       0.285714285714286
C       0.214285714285714
G       0.285714285714286
T       0.214285714285714

------------------------------
Dinucleotide frequency:
Dinucleotide    Obs. freq.      Exp. freq.
AG      0.0769230769230769      0.0816326530612245
AT      0.153846153846154       0.0612244897959184
CA      0.0769230769230769      0.0612244897959184
CC      0.0769230769230769      0.0459183673469388
CT      0.0769230769230769      0.0459183673469388
GA      0       0.0816326530612245
GC      0.153846153846154       0.0612244897959184
GG      0.0769230769230769      0.0816326530612245
TA      0.0769230769230769      0.0612244897959184
TG      0.153846153846154       0.0612244897959184
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by biolab1.1k

i have to take only one which i will consider ? expercted or observed?

ADD REPLYlink written 3.5 years ago by harpreetmanku0410

Hi, harpreetmanku04, you can use the observed value. It's the real di-nucleotide frequency of your input genome.

ADD REPLYlink written 3.5 years ago by biolab1.1k

ok thanku :)

ADD REPLYlink written 3.5 years ago by harpreetmanku0410
5
gravatar for Pavel Senin
5.4 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

Just a suggestion, you can solve this (or test your script results) by using EMBOSS command line utility compseq:

$ compseq
Calculate the composition of unique words in sequences
Input sequence(s): test.fa
Word size to consider (e.g. 2=dimer) [2]: 
Program compseq output file [test.composition]: test.composition

the output:

$ less test.composition
#
# Output from 'compseq'
#
# The Expected frequencies are calculated on the (false) assumption that every
# word has equal frequency.
#
# The input sequences are:
#       FM3R2-A1
#       FM3R2-B1
# ... et al.


Word size       2
Total count     41749

#
# Word  Obs Count       Obs Frequency   Exp Frequency   Obs/Exp Frequency
#
AA      3207            0.0768162       0.0625000       1.2290594
AC      2377            0.0569355       0.0625000       0.9109679
AG      3122            0.0747802       0.0625000       1.1964837
AT      2053            0.0491748       0.0625000       0.7867973
CA      1878            0.0449831       0.0625000       0.7197298
CC      1999            0.0478814       0.0625000       0.7661022
CG      2848            0.0682172       0.0625000       1.0914752
CT      2155            0.0516180       0.0625000       0.8258880
GA      3234            0.0774629       0.0625000       1.2394069
GC      2925            0.0700616       0.0625000       1.1209849
GG      4474            0.1071642       0.0625000       1.7146279
GT      2810            0.0673070       0.0625000       1.0769120
TA      2434            0.0583008       0.0625000       0.9328128
TC      1580            0.0378452       0.0625000       0.6055235
TG      3006            0.0720017       0.0625000       1.1520276
TT      1647            0.0394500       0.0625000       0.6312007

Other   0               0.0000000       0.0000000       10000000000.0000000
ADD COMMENTlink written 5.4 years ago by Pavel Senin1.9k

Hi, Pavel,Thank you for your suggestions, which are really helpful. I previously used a fasta file containing multiple sequences for input. That generate error message. Cheers.

ADD REPLYlink written 5.4 years ago by biolab1.1k

hello do we need to install any software to run this?

ADD REPLYlink written 3.5 years ago by harpreetmanku0410
2
gravatar for Csaba Kerepesi
5.4 years ago by
Hungary
Csaba Kerepesi330 wrote:

Hi!

Delete description line(s) of the sequence file. Input file should contain only characters A,T,G,C.

I have tried and the script works without warning messages:

~/perl$ perl dinucleotid-freq.perl ecolidarab
A: 0.237818
T: 0.241783
G: 0.269439
C: 0.250960

AA: 0.069548    0.056557
AT: 0.063013    0.057500
AG: 0.051856    0.064077
AC: 0.053400    0.059683
TA: 0.043319    0.057500
TT: 0.068591    0.058459
TG: 0.074130    0.065146
TC: 0.055743    0.060678
GA: 0.060401    0.064077
GT: 0.058343    0.065146
GG: 0.065288    0.072597
GC: 0.085408    0.067618
CA: 0.064550    0.059683
CT: 0.051837    0.060678
CG: 0.078165    0.067618
CC: 0.056409    0.062981
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Csaba Kerepesi330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1622 users visited in the last hour