Question

Script To Calculate Dinucleotide Frequency For Many Sequences

2

Entering edit mode

10.3 years ago

biolab ★ 1.4k

Hi, everyone,

I need to calculate dinucleotide frequency for many sequences. As i know, the R command count(seq, 2, freq = TRUE) can calculate only one sequence. I am new in perl. Does anyone has reliable script to complete this task? I downloaded one script (shown below), but when running it, obviously there is error message:

Use of uninitialized value $G[0] in array element at dinucleotide.pl line 31.
Use of uninitialized value $G[0] in multiplication (*) at dinucleotide.pl line 32.
..................

Thank you very much!

#!/usr/bin/perl -w

##
## Display all dinucleotide frequencies along with their expected
## values under the independence hypothesis.
##

##  Hashes for translating between bases and numbers
%H = ("A", 0, "T", 1, "G", 2, "C", 3);
%HI = (0, "A", 1, "T", 2, "G", 3, "C");

## The data file comes from the command line.
$file = shift @ARGV;

## Read the data.
open(F, "$file") or die "Unable to open $file\n";
while (<F>)
{
    chomp;
    $K = length($_);
    foreach $j (0..$K-1) { push(@G, $H{substr($_, $j, 1)}); }
}
close(F);

## The number of genes.
$L = scalar(@G);

## Get the marginal and joint frequencies.
foreach $i (0..$L-2)
{
    ++$M[$G[$i]];
    ++$J[4 * $G[$i] + $G[$i + 1]];
}

## Normalize.
foreach $j (0..3) { $M[$j] /= ($L - 1); }
foreach $j (0..15) { $J[$j] /= ($L - 1); }

## Display the marginals.
foreach $j (0..3) { print sprintf("%s: %f\n", $HI{$j}, $M[$j]); }
print "\n";

## Display the dinucleotide frequencies.
foreach $j (0..3)
{
    foreach $k (0..3)
    {
    $P = $M[$j] * $M[$k];
    $Q = 4 * $j + $k;
    print sprintf("%s%s: %f\t%f\n", $HI{$j}, $HI{$k}, $J[$Q], $P);
    }
}

perl frequency • 14k views

ADD COMMENT • link updated 2.5 years ago by xiaoleiusc ▴ 140 • written 10.3 years ago by biolab ★ 1.4k

0

Entering edit mode

I should note that I do not need to calculate each sequence's dinucleotide frequency. I need to get the di-nt frequency from all sequences. Thank you for your attention.

ADD REPLY • link 10.3 years ago by biolab ★ 1.4k

0

Entering edit mode

How are you calling the script? Command line and input file format?

ADD REPLY • link 10.3 years ago by vodka ▴ 80

0

Entering edit mode

problem solved. thank you for comment.

ADD REPLY • link 10.3 years ago by biolab ★ 1.4k

0

Entering edit mode

I also get the same problem, how can i resolve it?

I used command

perl dinucleotide.pl -in utr.fasta > out

it gives an error :(

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by harpreetmanku04 ▴ 10

1

Entering edit mode

Hi, EMBOSS is a good tool. I wrote the following perl script for your reference.

#!/usr/bin/perl -w
use strict;

my ($genome, $head, $tail);
my (%mono_nt, %di_nt);

$/ = ">";
open my $fasta, '<', $ARGV[0] or die $!;
while (<$fasta>) {
    chomp; s/\r//g; s/^\s*|\s*$//;
    if (/.+?\n(.+)/s) {
        (my $seq = $1) =~ s/\n//g;
        $genome .= uc $seq;
        $head = uc substr($seq, 0, 1);
        $di_nt{"$tail$head"}-- if $tail;
        $tail = uc substr($seq, -1);
    }
}
close $fasta;

my $len = length $genome;
for my $i (0..$len-2) {
    my $each_mono_nt = substr($genome, $i, 1);
    my $each_di_nt   = substr($genome, $i, 2);
    $mono_nt{$each_mono_nt}++;
    $di_nt{$each_di_nt}++;
}
$mono_nt{$tail}++;

print "-"x30, "\nSingle nucleotide frequency:\n";
for my $nt (sort keys %mono_nt) {print "$nt\t", $mono_nt{$nt} / $len, "\n";}

print "\n", "-"x30, "\nDinucleotide frequency:\nDinucleotide\tObs. freq.\tExp. freq.\n";
for my $nt_pair (sort keys %di_nt) {
    my ($first_nt, $second_nt) = split //, $nt_pair;
    print "$nt_pair\t", $di_nt{$nt_pair} / ($len-1), "\t",
        $mono_nt{$first_nt} * $mono_nt{$second_nt} /$len /$len, "\n";
}

For example, a genome fasta file is as below.

>Chr1
ATGCTAG
>Chr2
aTGGcCA

You can run perl di_nt_freq.pl genome.fa and get the following.

------------------------------
Single nucleotide frequency:
A       0.285714285714286
C       0.214285714285714
G       0.285714285714286
T       0.214285714285714

------------------------------
Dinucleotide frequency:
Dinucleotide    Obs. freq.      Exp. freq.
AG      0.0769230769230769      0.0816326530612245
AT      0.153846153846154       0.0612244897959184
CA      0.0769230769230769      0.0612244897959184
CC      0.0769230769230769      0.0459183673469388
CT      0.0769230769230769      0.0459183673469388
GA      0       0.0816326530612245
GC      0.153846153846154       0.0612244897959184
GG      0.0769230769230769      0.0816326530612245
TA      0.0769230769230769      0.0612244897959184
TG      0.153846153846154       0.0612244897959184

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by biolab ★ 1.4k

0

Entering edit mode

I have to take only one which I will consider? expected or observed?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by harpreetmanku04 ▴ 10

1

Entering edit mode

Hi, harpreetmanku04, you can use the observed value. It's the real di-nucleotide frequency of your input genome.

ADD REPLY • link 8.5 years ago by biolab ★ 1.4k

0

Entering edit mode

ok thanku :)

ADD REPLY • link 8.5 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

By the way, does this support genome.fa.gz compressed file as input?

ADD REPLY • link 2.5 years ago by xiaoleiusc ▴ 140

Ram · Answer 1 · 2014-01-09

Just a suggestion, you can solve this (or test your script results) by using EMBOSS command line utility compseq:

$ compseq
Calculate the composition of unique words in sequences
Input sequence(s): test.fa
Word size to consider (e.g. 2=dimer) [2]: 
Program compseq output file [test.composition]: test.composition

the output:

$ less test.composition
#
# Output from 'compseq'
#
# The Expected frequencies are calculated on the (false) assumption that every
# word has equal frequency.
#
# The input sequences are:
#       FM3R2-A1
#       FM3R2-B1
# ... et al.


Word size       2
Total count     41749

#
# Word  Obs Count       Obs Frequency   Exp Frequency   Obs/Exp Frequency
#
AA      3207            0.0768162       0.0625000       1.2290594
AC      2377            0.0569355       0.0625000       0.9109679
AG      3122            0.0747802       0.0625000       1.1964837
AT      2053            0.0491748       0.0625000       0.7867973
CA      1878            0.0449831       0.0625000       0.7197298
CC      1999            0.0478814       0.0625000       0.7661022
CG      2848            0.0682172       0.0625000       1.0914752
CT      2155            0.0516180       0.0625000       0.8258880
GA      3234            0.0774629       0.0625000       1.2394069
GC      2925            0.0700616       0.0625000       1.1209849
GG      4474            0.1071642       0.0625000       1.7146279
GT      2810            0.0673070       0.0625000       1.0769120
TA      2434            0.0583008       0.0625000       0.9328128
TC      1580            0.0378452       0.0625000       0.6055235
TG      3006            0.0720017       0.0625000       1.1520276
TT      1647            0.0394500       0.0625000       0.6312007

Other   0               0.0000000       0.0000000       10000000000.0000000

Ram · Answer 2 · 2014-01-09

2

Entering edit mode

10.3 years ago

Csaba Kerepesi ▴ 350

Hi!

Delete description line(s) of the sequence file. Input file should contain only characters A,T,G,C.

I have tried and the script works without warning messages:

~/perl$ perl dinucleotid-freq.perl ecolidarab
A: 0.237818
T: 0.241783
G: 0.269439
C: 0.250960

AA: 0.069548    0.056557
AT: 0.063013    0.057500
AG: 0.051856    0.064077
AC: 0.053400    0.059683
TA: 0.043319    0.057500
TT: 0.068591    0.058459
TG: 0.074130    0.065146
TC: 0.055743    0.060678
GA: 0.060401    0.064077
GT: 0.058343    0.065146
GG: 0.065288    0.072597
GC: 0.085408    0.067618
CA: 0.064550    0.059683
CT: 0.051837    0.060678
CG: 0.078165    0.067618
CC: 0.056409    0.062981

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 10.3 years ago by Csaba Kerepesi ▴ 350

0

Entering edit mode

why your results of dinucleotide frequency have two columns (e.g. AA: 0.069548 and 0.056557), which one is the frequency?

ADD REPLY • link 2.6 years ago by xiaoleiusc ▴ 140