Hi, everyone,
I need to calculate dinucleotide frequency for many sequences. As i know, the R command count(seq, 2, freq = TRUE)
can calculate only one sequence. I am new in perl. Does anyone has reliable script to complete this task? I downloaded one script (shown below), but when running it, obviously there is error message:
Use of uninitialized value $G[0] in array element at dinucleotide.pl line 31.
Use of uninitialized value $G[0] in multiplication (*) at dinucleotide.pl line 32.
..................
Thank you very much!
#!/usr/bin/perl -w
##
## Display all dinucleotide frequencies along with their expected
## values under the independence hypothesis.
##
## Hashes for translating between bases and numbers
%H = ("A", 0, "T", 1, "G", 2, "C", 3);
%HI = (0, "A", 1, "T", 2, "G", 3, "C");
## The data file comes from the command line.
$file = shift @ARGV;
## Read the data.
open(F, "$file") or die "Unable to open $file\n";
while (<F>)
{
chomp;
$K = length($_);
foreach $j (0..$K-1) { push(@G, $H{substr($_, $j, 1)}); }
}
close(F);
## The number of genes.
$L = scalar(@G);
## Get the marginal and joint frequencies.
foreach $i (0..$L-2)
{
++$M[$G[$i]];
++$J[4 * $G[$i] + $G[$i + 1]];
}
## Normalize.
foreach $j (0..3) { $M[$j] /= ($L - 1); }
foreach $j (0..15) { $J[$j] /= ($L - 1); }
## Display the marginals.
foreach $j (0..3) { print sprintf("%s: %f\n", $HI{$j}, $M[$j]); }
print "\n";
## Display the dinucleotide frequencies.
foreach $j (0..3)
{
foreach $k (0..3)
{
$P = $M[$j] * $M[$k];
$Q = 4 * $j + $k;
print sprintf("%s%s: %f\t%f\n", $HI{$j}, $HI{$k}, $J[$Q], $P);
}
}
I should note that I do not need to calculate each sequence's dinucleotide frequency. I need to get the di-nt frequency from all sequences. Thank you for your attention.
How are you calling the script? Command line and input file format?
problem solved. thank you for comment.
I also get the same problem, how can i resolve it?
I used command
it gives an error :(
Hi, EMBOSS is a good tool. I wrote the following perl script for your reference.
For example, a genome fasta file is as below.
You can run
perl di_nt_freq.pl genome.fa
and get the following.I have to take only one which I will consider? expected or observed?
Hi, harpreetmanku04, you can use the observed value. It's the real di-nucleotide frequency of your input genome.
ok thanku :)
By the way, does this support genome.fa.gz compressed file as input?