R Plotting Line Graph With Large Dataset
3
4
Entering edit mode
9.3 years ago
kajendiran56 ▴ 120

Dear All, thank you for your time. I have a dataset that is very large (17 million x/y in a table). I am trying to find a way to represent this as a graph but considering the amount of data points I am unsure as to the best strategy. R does not complain when I try to plot a line graph, however, it fails to show any graph. The code I am using is correct as I have tested with a smaller dataset, and I checked RAM usage and whilst it goes up to around 8gb, there is still plenty free. Why does it not complain if there is an error?

The best answer that I can come up with is to basically sample but is this best or the only option?

Thank you

Dear all, many thanks for all your suggestions, I will try to answer as best I can. The reason for such a large dataset it that I am using the entire UniRef100 Database of protein sequences which contains:

   kajendiran@serenity:~/Documents\$ grep UniRef100_ -c uniref100.fasta
17,598,871


I am using the data contained within to calculate the Mw (molecular weight) in Kda (kilo dalton) and another value for each sequence and then plot these two against each other to represent the entire database. I realise completely that with that many data points, individual points would become difficult to determine. I just wanted to see if it was possible in R and to see what the result would look like if I sorted one axis according to the ascending order of the other and create a line graph. The way to produce a legible graph would be, as someone suggested, to segregate data within axis or to use other methods of which I was not aware.

I do not have much experience with R, thus I was curious to understand why R does not complain when I try to plot such a large amount of data but produces an empty graph. It appears it is purely the size of the dataset that is the issue, as it works perfectly with smaller, yet significant datasets. I also wanted to learn from your experience the best strategy to produce a graph in this scenario.

I agree that this could have been posted in stackoverflow, however, as someone states, it is certainly common in Bioinformatics to have to deal with obscene amounts of data in certain scenarios. This is what makes this field so exciting and yet at times also painful for cpu's, ram chips, our patience and our minds. The other reason that I posted this question here, is that I have asked a few questions this week pertaining to different facets of a larger challenge that I am trying to overcome during an internship.

Anyways, I wanted to once again thank you for your time, you have all given me alot to think about.

Here is just for anyone interested, the graph I produced by segregating and ordering the mw values according to the y axis:

http://www.freeimagehosting.net/zxvey

r plot dataset • 16k views
3
Entering edit mode

Could you provide us the code snippet and sample input file? Also if you can provide us with any error you are getting, that will be great.

3
Entering edit mode

Its highly improbable to tell what the problem might be without having a look at your code.

3
Entering edit mode

Please indicate the relevance of this question to a bioinformatics research problem. Otherwise, it's a basic R usage question better suited to stackoverflow.com.

2
Entering edit mode

People doing bioinformatics research often have to examine very large data sets, thus questions on the examination of very large data sets are relevant questions to ask of people doing bioinformatics research. Of course the question would be much tastier if the poster had indicated what, exactly, they were trying to examine with 17 million data points. Perhaps people asking such questions could be asked to add a little biology to their posts, as opposed to sending them elsewhere.

1
Entering edit mode

Asking for the biological relevance is precisely what I did. And no-one is being "sent" anywhere; merely pointing out that when questions are phrased in terms of R usage, they belong in R usage forums.

0
Entering edit mode

I have a different opinion. Though phrased in terms of R usage, I think the question is completely appropriate here. Language usage questions + Context = relevance. The context of this forum is bioinformatics. The context provides unique value not supplied elsewhere. A language usage question asked here can be enlightening in a way that if asked in a language specific forum would be completely opaque (to me).

0
Entering edit mode

I would add that I see many questions here being asked without context of the biological problem. These end up with answers where people try to guess what has really been asked and answer based on that. Taking the time to write a question is important. Taking the time to add biological information when this is asked in the comments is also a matter of respect for the time people spend answering questions. Failing to do so is disrespectful.

0
Entering edit mode

I have edited the original question and added the additional information given by the poster in an answer. I think, as it stands now, the relation to biology is becoming more clear. In this case, due to the additional information given, it turned out the question could be answered in a much better way. That, I am certain, is beneficial for the whole site and the OP.

0
Entering edit mode

Perfect! And thanks for the edit.

13
Entering edit mode
9.3 years ago

It doesn't make much sense to plot 17 million individual data points, because you will just get a black blob.

Imagine the number of pixels the image of your plot has. If you represent your plot in a 1440x900 pixel plot (1.29Mpixel) you will have more than 10 points per pixel on average if your points are distributed evenly. If you have so many data points, you have to convert your points into some kind of density and plot this. If it was genome related data you can compute the average number of points in a window of size e.g. 100bp and plot this as a curve or heat map at the chromosome. It is hard to tell what would be a good way to represent your data because the lack of detail in your question.

On the other hand there is nothing to hold you back from doing this technically, you just need to be patient (or use a faster (non windows) computer):

mat = matrix(rnorm(2*1.7e7), ncol=2)
plot(mat)


It takes about 5 minutes to plot this, but otherwise it is not a problem at all.

Edit:

When you are plotting mass versus charge (like in an artificial 2D gel), as an example, you might want to look at the smoothScatter function in R. It computes the local density of the data and assigns colors accordingly. I don't see a good reason for using a line graph though. I used the example code from smoothScatter but set the number of points to 10 millions:

  n <- 1e7
x1  <- matrix(rnorm(n), ncol=2)
x2  <- matrix(rnorm(n, mean=3, sd=1.5), ncol=2)
x   <- rbind(x1,x2)

oldpar <- par(mfrow=c(2,2))
smoothScatter(x, nrpoints=0)
smoothScatter(x)

## a different color scheme:
Lab.palette <-
colorRampPalette(c("blue", "orange", "red"), space = "Lab")
smoothScatter(x, colramp = Lab.palette)

## somewhat similar, using identical smoothing computations,
## but considerably *less* efficient for really large data:
plot(x, col = densCols(x), pch=20)
## as the comment said this is considerably less efficient, so I didn't wait for it to finish!


Here is the old blob... for people who like it

4
Entering edit mode

I think I've seen that dataset before.

0
Entering edit mode

Thank you for your revisions, I have never seen this type of plot before, I used a line graph simply to get a general idea or the relationship between the two parameters. This smoothscatter graph is brilliant, it will highlight population densities for me. Once again, my thanks.

6
Entering edit mode
9.3 years ago
ff.cc.cc ★ 1.3k

I would suggest looking at package specific for genome-wide data (p-values ~ data-points) e.g. ggbio package, and read a couple of old post, starting from genome-wide-plots-in-r.

If your issue is concerned with loading data in RAM, consider adopting solutions for bigdata like hdf, ff and so on... here you can find an introduction to the matter.

Have a nice graph!

p.s. :

I substantially agree with michael's comment about usefulness of trying to visualize milions of points/informations... but I also recall that It was sometimes good to me display exploratory snapshots of data (e.g. p-values landmark, data distribution hypothesis, regional plots...). In my experience I never plotted more than 1M points, but NGS data and next GWAS chips will reach higher order of magnitude...

finally, please make clear what kind of data you collected and what information you want to extract from them

0
Entering edit mode

If I may ask, which journal or resource did you get the above graphs?

0
Entering edit mode

if you right-click on the image, you can see the source of the image

0
Entering edit mode

Hi, I choose images just as an example, but if it could help: the first manhattan plot is quite easy to produce with R or haploview (and many others...), while the second graph, useful e.g. for visualizing interaction data, comes from CIRCOS