How to plot a histogram for billions of genotype quality values?
2
1
Entering edit mode
5.6 years ago
William ★ 4.9k

How can I best plot a histogram for billions of genotype quality values?

I have a simple one column file with billions of genotype quality values. The file is several GB uncompressed.

Is there a statistics library in Python or R that can build up a histogram by streaming trough the data? Instead of loading everything to memory and then creating the histogram? I prefer using all of the data versus sampling it.

Or do I have to write a script first to collect the counts per bin and then give those count per bin to R for plotting? This functionality feels like it should already exist in a stats library some where.

I know the min and max of the values and would be able to specify a bin size.

vcf quality R python • 1.9k views
ADD COMMENT
0
Entering edit mode

what kind of graph do you need ? qual=f(pos) ?

ADD REPLY
0
Entering edit mode

I've been always intrigued by the potential of using a data log visualizer like RRD see http://oss.oetiker.ch/rrdtool/index.en.html to visualize genomic data (the time would be replaced by the coordinate of the genome).

ADD REPLY
0
Entering edit mode

I prefer using all of the data versus sampling it

Could you motivate this? If you want to produce a histogram to visualize the distribution of quality values, I don't see why you need billions of data points, especially since the range of quality values is discrete and not very large. After you have collected a million or so data points at random you have a pretty good estimate of the all thing.

ADD REPLY
0
Entering edit mode

Sampling also adds complexity and risk doing it wrong. I at least would need to think about how to do it correctly. By itself the frequency of the analysis and size of the data still allow for processing the total collection.

ADD REPLY
3
Entering edit mode
5.6 years ago

Or do I have to write a script first to collect the counts per bin and then give those count per bin to R for plotting? This functionality feels like it should already exist in a stats library some where.

This is what I'd do. You're right that it might exist, but writing a 10 line perl script to bin them might be quicker than searching. I'd wager you'd be done before someone chimes in with the library to do it :)

ADD COMMENT
2
Entering edit mode
5.6 years ago
biocyberman ▴ 830

Meet the selling point of Datashader: http://datashader.readthedocs.org/en/latest (even though I tend to filter and subset the data whenever possible).

ADD COMMENT

Login before adding your answer.

Traffic: 2216 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6