How can I best plot a histogram for billions of genotype quality values?
I have a simple one column file with billions of genotype quality values. The file is several GB uncompressed.
Is there a statistics library in Python or R that can build up a histogram by streaming trough the data? Instead of loading everything to memory and then creating the histogram? I prefer using all of the data versus sampling it.
Or do I have to write a script first to collect the counts per bin and then give those count per bin to R for plotting? This functionality feels like it should already exist in a stats library some where.
I know the min and max of the values and would be able to specify a bin size.