Question: Perl :Shuffle An Array 10 Times, Computing 10 Average Maxes, Printing The Mean Max Average. Repeat This Entire Process 1000 Times.
0
6.8 years ago by
Neal40
Norway
Neal40 wrote:

Hello all,

This is my first post here, but I will try to explain the programming problem as best as I can.

I have a data set which looks like the following

``````NR_046018    DDX11L1    ,    0    0    1    1    1    1    1    1    1    1    0    0    0    0    1.44    2.72    3.84    4.92
NR_047520    LOC643837    ,    3    2.2    0.2    0    0    0.28    1    1    1    1    2.2    4.8    5    5.32    5    5    5    5    3
NM_001005484    OR4F5    ,    2    2    2    1.68    1    0.48    0    0.92    1    1.8    2    2    2    2.04    3.88    3
NR_028327    LOC100133331    ,    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
``````

What is needed

1. Shuffle the array 10 times. After _each_ shuffle, divide the array into 2 new arrays, say _set1_ and _set2_.

2. From each new array, compute maximum average of each row of numbers.

3.Get 10 maximum averages of each _set1_ and _set2_. Compute the average of the 10 maximum averages obtained for _each_ set, let's call it _10avg1_ and _10avg2_ .
4. Get a list of 1000 _10avg2_ and 1000 _10avg2_.

Code

``````use warnings;
use List::Util 'shuffle';
use List::Util qw(max);

my \$file = 'mergesmall.txt';

open my \$fh,'<',\$file or die "Unable to open file";
open OUT,">Shuffle.out" or die;

my @arr = <\$fh>;

my \$i=10;
while(\$i){
my @arr1 = ();  #Intitialize 1st set
my @arr2 = ();  #Initialize 2nd set

my @shuffled = shuffle(@arr);

push @arr1,(@shuffled[0..1]); #Shift into 1st set
push @arr2,(@shuffled[2..3]); #Shift into 2nd set

foreach \$_(@arr1){
my @val1 = split;
my \$max1 = max(@val1[3..\$#val1]);

\$total1 += \$max1;
\$num1++;
}

my \$average_max1 = \$total1 /  \$num1;
#print "\n\n","Average max 1st set is : ",\$average_max1;
print OUT "Average max 1st set is : ",\$average_max1;

foreach \$_(@arr2){
my @val2 = split;
my \$max2 = max(@val2[3..\$#val2]);

print "\n\n";

\$total2 += \$max2;
\$num2++;
}

my \$average_max2 =  \$total2 /  \$num2;
#print "\n\n","Average max 2nd set is : ",\$average_max2;
print OUT "\n","Average max 2nd set is : ",\$average_max2,"\n\n";

\$i--;

}
``````

The Problem

The code I have been able to write so far can get 10 maximum averages of each _set1_ and _set2_. I am not able to figure out how to compute the average of these 10 maximum averages. If I can figure out this, I can easily put a `for` loop to run 1000 times and obtain 1000 _10avgset1_ and 1000 _10avgset2_

Points to Note

1. The actual data set has each row comprising a maximum of 400 numbers, some rows have less than that, some have none at all, but never more than 400.

2.The actual dataset has 41,382 rows. Set1 will comprise of 23,558 rows and set2 will comrpise of 17,824 rows.

3.File is a .txt file and all the numbers in each row are tab delimited.

perl bioinformatics array • 3.6k views
modified 6.8 years ago by SES8.2k • written 6.8 years ago by Neal40

Could you please explain what the application to bioinformatics is? It looks like you are doing some resampling here? Did I get it right that you you want to compute maximum of the averages, not maximum and averages?

@Michael Hello Michael! Thank you for your comment. This data is a small part of ChIP-Seq data for K562 cell line which I've been given to analyze. Yes we are doing some resampling here, we are trying to generate a control set actually. And thank you for asking for a clarification, I think I should reframe the question. I need to compute the average maximum for all rows. So for example, I find the maximum value in NR046018, which is 4.92 here. Similarly for NR047520(5.32) and so on for all the rows(23,558 in set1) and (17,824 in set2). Once these maximum values are found, I need to find what is the average maximum.

And since we are trying to generate a control set, I have to shuffle the main data(one which has 41,382 rows. This main dataset was generated by combining two pre-existing datasets 1 and 2). So for each shuffle, we divide the new shuffled array into 2 new arrays, compute average maximum for each of those new sets, and we shuffle 10 times , obtaining 10 average maximums for each set. So now, we have 10 average maximums for set 1 and similarly for set 2. (I have been able to do it this far) From these 10 average maximums, I need to find the mean average. And then this process of 10 shufflings neds to be repeated 1000 times, so I have 1000 mean averages. I hope I was able to explain myself a little better...

3
6.8 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

This question is a bit tricky for several reasons.

1. The data is oddly formatted so you need to do more than just slurp the whole file into an array (this is almost always the wrong thing to do).
2. There are an unequal number of values for each line. That means you will need to think about how to select/sort values equally per line/measurement.
3. It is not clear to me why you are using this selection algorithm, so I won't attempt to code that part.

Concerning 1), you probably just want those values right? Slurping the whole file and shuffling is likely not doing what you want. Here is one way to get the values:

``````#!/usr/bin/env perl

use v5.10; # make sure we have at least Perl 5.10 so we can use the feature 'say'
use strict;
use warnings;
use Data::Dump;
use Text::CSV;
use List::Util qw(sum max shuffle);

my \$csv = Text::CSV->new({sep_char => "    "});

my @cols;

while (<DATA>) {
chomp;
my (\$ids, \$values) = split(/\,    /, \$_);
if (\$csv->parse(\$values)) {
@cols = \$csv->fields();
say "Data:          ",join(" ",@cols);
say "Shuffled data: ",join(" ",shuffle(@cols));
# Here you can select/sort the shuffled data how you like
say "Mean:          ",mean(@cols);
say "Max:           ",max(@cols);
say "";
}
else {
my \$err = \$csv->error_input;
say "Failed to parse line: \$err";
}
}

sub mean { return @_ ? sum(@_) / @_ : 0 }

__DATA__
NR_046018    DDX11L1    ,    0    0    1    1    1    1    1    1    1    1    0    0    0    0    1.44    2.72    3.84    4.92
NR_047520    LOC643837    ,    3    2.2    0.2    0    0    0.28    1    1    1    1    2.2    4.8    5    5.32    5    5    5    5    3
NM_001005484    OR4F5    ,    2    2    2    1.68    1    0.48    0    0.92    1    1.8    2    2    2    2.04    3.88    3
NR_028327    LOC100133331    ,    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
``````

If we call this "biostars_59584.pl" then:

``````\$ perl biostars_59584.pl
Data:          0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
Shuffled data: 1 2.72 1 1 0 1.44 4.92 1 0 1 0 1 1 0 0 3.84 0 1
Mean:          1.16222222222222
Max:           4.92

Data:          3 2.2 0.2 0 0 0.28 1 1 1 1 2.2 4.8 5 5.32 5 5 5 5 3
Shuffled data: 5 1 1 0.2 2.2 3 1 5.32 5 3 5 1 4.8 5 0 0.28 0 5 2.2
Mean:          2.63157894736842
Max:           5.32

Data:          2 2 2 1.68 1 0.48 0 0.92 1 1.8 2 2 2 2.04 3.88 3
Shuffled data: 2.04 3 0 1 2 2 0.92 2 1 2 0.48 2 2 3.88 1.8 1.68
Mean:          1.7375
Max:           3.88

Data:          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Shuffled data: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mean:          0
Max:           0
``````

Generating shuffled samples of those values is easy now if you edit a couple of lines:

``````while (<DATA>) {
chomp;
my (\$ids, \$values) = split(/\,    /, \$_);
if (\$csv->parse(\$values)) {
@cols = \$csv->fields();
say "Data:                ",join(" ",@cols);
for (my \$i = 0; \$i < 10; \$i++ ) { # Use a C-style for loop to generate 10 samples
say "Shuffled data set \$i: ",join(" ",shuffle(@cols));
# Here you can select/sort the shuffled data how you like
}
say "";
}
else {
my \$err = \$csv->error_input;
say "Failed to parse line: \$err";
}
}
``````

Looking at the first set, you can see the shuffled data:

``````\$ perl biostars_59584.pl | head -11
Data:                0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
Shuffled data set 0: 0 1 1 1 0 0 0 1 2.72 4.92 0 1 1 0 1 3.84 1.44 1
Shuffled data set 1: 1.44 1 4.92 1 1 1 1 0 1 2.72 0 1 3.84 1 0 0 0 0
Shuffled data set 2: 0 4.92 0 1 1 3.84 1 1 1 1 1 2.72 1 0 0 1.44 0 0
Shuffled data set 3: 1 1 0 0 1 3.84 1.44 1 1 2.72 0 1 1 4.92 0 0 1 0
Shuffled data set 4: 4.92 1 1 1.44 0 1 1 0 3.84 2.72 0 0 1 1 0 1 1 0
Shuffled data set 5: 1 0 1 1 4.92 0 0 0 0 1 1.44 1 0 1 3.84 2.72 1 1
Shuffled data set 6: 4.92 3.84 0 1 1 1 1 0 0 1.44 0 1 1 0 2.72 1 0 1
Shuffled data set 7: 1 1 0 0 1 1 1 0 2.72 0 1.44 1 1 1 4.92 3.84 0 0
Shuffled data set 8: 4.92 0 1.44 1 1 0 1 1 1 0 1 1 0 1 0 3.84 0 2.72
Shuffled data set 9: 1 4.92 1 1 0 1 0 1 0 2.72 0 1 1.44 0 1 1 3.84 0
``````

From there, you just need to figure out the appropriate way to sample, then you can calculate the stats and store them (in a hash, for example). About 2) above, I'm not sure it makes sense to slice the first few elements off an array and "resample" those as you are trying to do. Note that there are many ways of sampling an array, so try to figure out the method you want then search for a package on CPAN. For example, there are already efficient methods for resampling means, and also may be more efficient ways of shuffling and selecting the elements from an array. If you are confident on the last point (3), then proceed (it should be pretty easy), or find a more appropriate algorithm and Perl package.

EDIT: Added a line to test if the code will work with older Perl versions. Note that the last link about sorting by index instead of value is something to keep in mind but I don't think it will really make a difference here. I would just use List::Util::shuffle for simplicity.

@SES Hello SES! I'm sorry for replying late as I was not in my lab the past 2 days. And many thanks for going through my question and attempting to resolve it. However, I think I need to explain myself a bit more. The individual values in the rows need not be shuffled. It is the `rows per se` which need to be shuffled. Taking the example of the small dataset I've written, it has 4 rows (compared to 41,382 in actual file). These 4 rows are shuffled once. This shuffled array is divided into 2 arrays, `@arr1` and `@arr2`each of which contain 2 rows here for simplicity. Now I find maximum of row1 and row2 in `@arr1` . Then the average max is computed in `\$average_max1`. The same thing is repeated for `@arr2`, to obtain `\$average_max2`. These are the results we obtain after 1st shuffle. Next, the main array `@arr` is shuffled again, and this time different rows are pushed into `@arr1` and `@arr2`, thereby giving different `\$average_max1` and `\$average_max2`. The code I've written does a fine job till this point.

After this is where I am getting stuck. 10 shufflings of main array `@arr` is giving 10 values each of `\$average_max1` and `\$average_max2`. I need a way to find the average of 10 `\$average_max1` , let's call it `\$superaverage1`, similarly the average of 10 `\$average_max2` needs to be found, let's call it `\$superaverage2`. Ultimately, I need to obtain 1000 values of `\$superaverage1` and 1000 values of `\$superaverage2`. I hope I was able to explain myself a bit better now...