Dear All, thank you for your time. I have a dataset that is very large (17 million x/y in a table). I am trying to find a way to represent this as a graph but considering the amount of data points I am unsure as to the best strategy. R does not complain when I try to plot a line graph, however, it fails to show any graph. The code I am using is correct as I have tested with a smaller dataset, and I checked RAM usage and whilst it goes up to around 8gb, there is still plenty free. Why does it not complain if there is an error?
The best answer that I can come up with is to basically sample but is this best or the only option?
Dear all, many thanks for all your suggestions, I will try to answer as best I can. The reason for such a large dataset it that I am using the entire UniRef100 Database of protein sequences which contains:
kajendiran@serenity:~/Documents$ grep UniRef100_ -c uniref100.fasta 17,598,871
I am using the data contained within to calculate the Mw (molecular weight) in Kda (kilo dalton) and another value for each sequence and then plot these two against each other to represent the entire database. I realise completely that with that many data points, individual points would become difficult to determine. I just wanted to see if it was possible in R and to see what the result would look like if I sorted one axis according to the ascending order of the other and create a line graph. The way to produce a legible graph would be, as someone suggested, to segregate data within axis or to use other methods of which I was not aware.
I do not have much experience with R, thus I was curious to understand why R does not complain when I try to plot such a large amount of data but produces an empty graph. It appears it is purely the size of the dataset that is the issue, as it works perfectly with smaller, yet significant datasets. I also wanted to learn from your experience the best strategy to produce a graph in this scenario.
I agree that this could have been posted in stackoverflow, however, as someone states, it is certainly common in Bioinformatics to have to deal with obscene amounts of data in certain scenarios. This is what makes this field so exciting and yet at times also painful for cpu's, ram chips, our patience and our minds. The other reason that I posted this question here, is that I have asked a few questions this week pertaining to different facets of a larger challenge that I am trying to overcome during an internship.
Anyways, I wanted to once again thank you for your time, you have all given me alot to think about.
Here is just for anyone interested, the graph I produced by segregating and ordering the mw values according to the y axis: