Question

Bioperl: Running Out Of Memory Using Bio::Popgen::Population

3

Entering edit mode

12.3 years ago

Abdel ▴ 410

Hi all,

I am trying to store information of about 2000 individuals from 2 populations with each individual having about 500,000 markers (SNPs) each. Unfortunately perl runs out of memory after storing about 60 individuals...

First I store the ~500,000 SNPs for an individual using Bio::PopGen::Individual, by doing the following in a for loop of 500,000 iterations:

$ind->add_Genotype(
                   Bio::PopGen::Genotype->new(-alleles     => ["$A1", "$A2"],
                                              -marker_name => "$rsid{$i}")
                  );

Then I store the individual in a population with Bio::PopGen::Population as follows:

$population{$pop}->add_Individual($ind);

I do this in a "while" loop for each line of my input file (which consists of about 2000 lines, each containing the individual ID, population membership and two alleles for 500,000 SNPs). Unfortunately, as I said, I get the message "Out of memory!" after having processed only about 60 individuals, despite using "undef" to empty all arrays/hashes that I stop using after the for and while loop.

The reason I am trying to do this, is because I want to calculate the Fst using Bio::PopGen::PopStats using all individuals and all markers. I'm sure there is a more efficient way of doing this, without using so much memory... Does anyone have any suggestions? Many thanks!

perl bioperl memory • 3.3k views

ADD COMMENT • link updated 12.3 years ago by Zev.Kronenberg 12k • written 12.3 years ago by Abdel ▴ 410

score 4 · Answer 1 · 2012-01-23

4

Entering edit mode

12.3 years ago

Zev.Kronenberg 12k

Storing that many loci in perl becomes expensive. I have done it. The course of action really depends on what format you data is in. If you can reformat your data there are many programs that can calculate FST for you.

Another option is to calculate FST per loci not requiring you to use much memory. You can use this distribution of FSTs to calculate an overall FST.

ADD COMMENT • link 12.3 years ago by Zev.Kronenberg 12k

1

Entering edit mode

Thanks, that's a good suggestion. If I would transpose my input file to have a SNP on each row instead of an individual on each row, that might work. Would an overall FST then be simply the mean of all 500,000 FSTs, or is it more complicated than that?

ADD REPLY • link 12.3 years ago by Abdel ▴ 410

1

Entering edit mode

The mean is the point estimate of the FST and you can also calculate error. Also you could use FDIST2 which has some cool functionality.