Question: Script To Calculate Dinucleotide Frequency For Many Sequences
1
biolab1.2k wrote:

Hi, everyone,

I need to calculate dinucleotide frequency for many sequences. As i know, the R command `count(seq, 2, freq = TRUE)` can calculate only one sequence. I am new in perl. Does anyone has reliable script to complete this task? I downloaded one script (shown below), but when running it, obviously there is error message:

``````Use of uninitialized value \$G in array element at dinucleotide.pl line 31.
Use of uninitialized value \$G in multiplication (*) at dinucleotide.pl line 32.
..................
``````

Thank you very much!

``````#!/usr/bin/perl -w

##
## Display all dinucleotide frequencies along with their expected
## values under the independence hypothesis.
##

##  Hashes for translating between bases and numbers
%H = ("A", 0, "T", 1, "G", 2, "C", 3);
%HI = (0, "A", 1, "T", 2, "G", 3, "C");

## The data file comes from the command line.
\$file = shift @ARGV;

open(F, "\$file") or die "Unable to open \$file\n";
while (<F>)
{
chomp;
\$K = length(\$_);
foreach \$j (0..\$K-1) { push(@G, \$H{substr(\$_, \$j, 1)}); }
}
close(F);

## The number of genes.
\$L = scalar(@G);

## Get the marginal and joint frequencies.
foreach \$i (0..\$L-2)
{
++\$M[\$G[\$i]];
++\$J[4 * \$G[\$i] + \$G[\$i + 1]];
}

## Normalize.
foreach \$j (0..3) { \$M[\$j] /= (\$L - 1); }
foreach \$j (0..15) { \$J[\$j] /= (\$L - 1); }

## Display the marginals.
foreach \$j (0..3) { print sprintf("%s: %f\n", \$HI{\$j}, \$M[\$j]); }
print "\n";

## Display the dinucleotide frequencies.
foreach \$j (0..3)
{
foreach \$k (0..3)
{
\$P = \$M[\$j] * \$M[\$k];
\$Q = 4 * \$j + \$k;
print sprintf("%s%s: %f\t%f\n", \$HI{\$j}, \$HI{\$k}, \$J[\$Q], \$P);
}
}
``````
perl frequency • 9.7k views
modified 6.9 years ago by Pavel Senin1.9k • written 6.9 years ago by biolab1.2k

I should note that I do not need to calculate each sequence's dinucleotide frequency. I need to get the di-nt frequency from all sequences. Thank you for your attention.

How are you calling the script? Command line and input file format?

problem solved. thank you for comment.

I also get the same problem, how can i resolve it?

I used command

``````perl dinucleotide.pl -in utr.fasta > out
``````

it gives an error :(

Hi, EMBOSS is a good tool. I wrote the following perl script for your reference.

``````#!/usr/bin/perl -w
use strict;

my (%mono_nt, %di_nt);

\$/ = ">";
open my \$fasta, '<', \$ARGV or die \$!;
while (<\$fasta>) {
chomp; s/\r//g; s/^\s*|\s*\$//;
if (/.+?\n(.+)/s) {
(my \$seq = \$1) =~ s/\n//g;
\$genome .= uc \$seq;
\$head = uc substr(\$seq, 0, 1);
\$tail = uc substr(\$seq, -1);
}
}
close \$fasta;

my \$len = length \$genome;
for my \$i (0..\$len-2) {
my \$each_mono_nt = substr(\$genome, \$i, 1);
my \$each_di_nt   = substr(\$genome, \$i, 2);
\$mono_nt{\$each_mono_nt}++;
\$di_nt{\$each_di_nt}++;
}
\$mono_nt{\$tail}++;

print "-"x30, "\nSingle nucleotide frequency:\n";
for my \$nt (sort keys %mono_nt) {print "\$nt\t", \$mono_nt{\$nt} / \$len, "\n";}

print "\n", "-"x30, "\nDinucleotide frequency:\nDinucleotide\tObs. freq.\tExp. freq.\n";
for my \$nt_pair (sort keys %di_nt) {
my (\$first_nt, \$second_nt) = split //, \$nt_pair;
print "\$nt_pair\t", \$di_nt{\$nt_pair} / (\$len-1), "\t",
\$mono_nt{\$first_nt} * \$mono_nt{\$second_nt} /\$len /\$len, "\n";
}
``````

For example, a genome fasta file is as below.

``````>Chr1
ATGCTAG
>Chr2
aTGGcCA
``````

You can run `perl di_nt_freq.pl genome.fa` and get the following.

``````------------------------------
Single nucleotide frequency:
A       0.285714285714286
C       0.214285714285714
G       0.285714285714286
T       0.214285714285714

------------------------------
Dinucleotide frequency:
Dinucleotide    Obs. freq.      Exp. freq.
AG      0.0769230769230769      0.0816326530612245
AT      0.153846153846154       0.0612244897959184
CA      0.0769230769230769      0.0612244897959184
CC      0.0769230769230769      0.0459183673469388
CT      0.0769230769230769      0.0459183673469388
GA      0       0.0816326530612245
GC      0.153846153846154       0.0612244897959184
GG      0.0769230769230769      0.0816326530612245
TA      0.0769230769230769      0.0612244897959184
TG      0.153846153846154       0.0612244897959184
``````

I have to take only one which I will consider? expected or observed?

Hi, harpreetmanku04, you can use the observed value. It's the real di-nucleotide frequency of your input genome.

ok thanku :)

5
Pavel Senin1.9k wrote:

Just a suggestion, you can solve this (or test your script results) by using EMBOSS command line utility compseq:

``````\$ compseq
Calculate the composition of unique words in sequences
Input sequence(s): test.fa
Word size to consider (e.g. 2=dimer) :
Program compseq output file [test.composition]: test.composition
``````

the output:

``````\$ less test.composition
#
# Output from 'compseq'
#
# The Expected frequencies are calculated on the (false) assumption that every
# word has equal frequency.
#
# The input sequences are:
#       FM3R2-A1
#       FM3R2-B1
# ... et al.

Word size       2
Total count     41749

#
# Word  Obs Count       Obs Frequency   Exp Frequency   Obs/Exp Frequency
#
AA      3207            0.0768162       0.0625000       1.2290594
AC      2377            0.0569355       0.0625000       0.9109679
AG      3122            0.0747802       0.0625000       1.1964837
AT      2053            0.0491748       0.0625000       0.7867973
CA      1878            0.0449831       0.0625000       0.7197298
CC      1999            0.0478814       0.0625000       0.7661022
CG      2848            0.0682172       0.0625000       1.0914752
CT      2155            0.0516180       0.0625000       0.8258880
GA      3234            0.0774629       0.0625000       1.2394069
GC      2925            0.0700616       0.0625000       1.1209849
GG      4474            0.1071642       0.0625000       1.7146279
GT      2810            0.0673070       0.0625000       1.0769120
TA      2434            0.0583008       0.0625000       0.9328128
TC      1580            0.0378452       0.0625000       0.6055235
TG      3006            0.0720017       0.0625000       1.1520276
TT      1647            0.0394500       0.0625000       0.6312007

Other   0               0.0000000       0.0000000       10000000000.0000000
``````

Hi, Pavel,Thank you for your suggestions, which are really helpful. I previously used a fasta file containing multiple sequences for input. That generate error message. Cheers.

hello do we need to install any software to run this?

2
Csaba Kerepesi330 wrote:

Hi!

Delete description line(s) of the sequence file. Input file should contain only characters A,T,G,C.

I have tried and the script works without warning messages:

``````~/perl\$ perl dinucleotid-freq.perl ecolidarab
A: 0.237818
T: 0.241783
G: 0.269439
C: 0.250960

AA: 0.069548    0.056557
AT: 0.063013    0.057500
AG: 0.051856    0.064077
AC: 0.053400    0.059683
TA: 0.043319    0.057500
TT: 0.068591    0.058459
TG: 0.074130    0.065146
TC: 0.055743    0.060678
GA: 0.060401    0.064077
GT: 0.058343    0.065146
GG: 0.065288    0.072597
GC: 0.085408    0.067618
CA: 0.064550    0.059683
CT: 0.051837    0.060678
CG: 0.078165    0.067618
CC: 0.056409    0.062981
``````