Bioperl: Running Out Of Memory Using Bio::Popgen::Population
1
3
Entering edit mode
12.3 years ago
Abdel ▴ 410

Hi all,

I am trying to store information of about 2000 individuals from 2 populations with each individual having about 500,000 markers (SNPs) each. Unfortunately perl runs out of memory after storing about 60 individuals...

First I store the ~500,000 SNPs for an individual using Bio::PopGen::Individual, by doing the following in a for loop of 500,000 iterations:

$ind->add_Genotype(
                   Bio::PopGen::Genotype->new(-alleles     => ["$A1", "$A2"],
                                              -marker_name => "$rsid{$i}")
                  );

Then I store the individual in a population with Bio::PopGen::Population as follows:

$population{$pop}->add_Individual($ind);

I do this in a "while" loop for each line of my input file (which consists of about 2000 lines, each containing the individual ID, population membership and two alleles for 500,000 SNPs). Unfortunately, as I said, I get the message "Out of memory!" after having processed only about 60 individuals, despite using "undef" to empty all arrays/hashes that I stop using after the for and while loop.

The reason I am trying to do this, is because I want to calculate the Fst using Bio::PopGen::PopStats using all individuals and all markers. I'm sure there is a more efficient way of doing this, without using so much memory... Does anyone have any suggestions? Many thanks!

perl bioperl memory • 3.3k views
ADD COMMENT
4
Entering edit mode
12.3 years ago

Storing that many loci in perl becomes expensive. I have done it. The course of action really depends on what format you data is in. If you can reformat your data there are many programs that can calculate FST for you.

Another option is to calculate FST per loci not requiring you to use much memory. You can use this distribution of FSTs to calculate an overall FST.

ADD COMMENT
1
Entering edit mode

Thanks, that's a good suggestion. If I would transpose my input file to have a SNP on each row instead of an individual on each row, that might work. Would an overall FST then be simply the mean of all 500,000 FSTs, or is it more complicated than that?

ADD REPLY
1
Entering edit mode

The mean is the point estimate of the FST and you can also calculate error. Also you could use FDIST2 which has some cool functionality.

ADD REPLY
0
Entering edit mode

There are a ton of FST calculations which are you using?

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6