Question: Data Too Big To Be Stored In Memory: Common Options
9
gravatar for Click downvote
7.4 years ago by
Germany
Click downvote670 wrote:

I am programming something where I do not need any more sophisticated data storage than Pythons dicts or sets. However, as my data is too big to be stored in memory I have to use something else.

I tried using Sqlite, but heard that it is slow for large datasets ( > 10 gb) and that NoSQL would be better.

What options do you commonly use to work with data that is too large to fit in memory and why? Any standard tools in bioinformatics?

Edit: Perhaps one should include a little bit about when ones answer is good to use and when it isn't?

(Ps. I know this isn't directly bioinformatics related, however I'm sure it is a problem many here has struggled with when working with bioinformatics.)

bioinformatics python • 23k views
ADD COMMENTlink modified 6.0 years ago by ole.tange3.7k • written 7.4 years ago by Click downvote670
7

I'll second Micans comment here. If you tell us what you are trying to do, we might be able to help more. There is not a general approach that fits all big data problems.

ADD REPLYlink written 7.4 years ago by Sean Davis26k

I just wanted a more general list of techniques. I hope such a thread is OK? Now I have learnt much and know of many ways of working around problems related to working with large datasets.

ADD REPLYlink written 7.4 years ago by Click downvote670

That is ok. But I still think you will learn more by asking a specific question. What is important is not to know a list of software/methods, but to know which to use in a specific case.

ADD REPLYlink written 7.4 years ago by lh332k
5

Your question is underpowered. A more specific description will enable more specific answers.

ADD REPLYlink written 7.4 years ago by Micans270

I don't know! Can you tell me more detail?

ADD REPLYlink written 6.0 years ago by He0
9
gravatar for Martin A Hansen
7.4 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:
  1. Make sure you are using the correct tool for the job e.g. don't use BLAST for mapping short reads.
  2. Avoid databases - they are absolutely last resort for analytical work.
  3. Preorder you data and process it in chunks that will fit in memory.
  4. If 3. is a problem -> think harder :o)
ADD COMMENTlink written 7.4 years ago by Martin A Hansen3.0k
2

I agree. I work with large data sets, but the only time I used a database was for TreeFam, a true database. There are frequently better solutions. If OP could provide more information, this discussion would be more to the point. Btw, on NoSQL, at least MongoDB is only fast only if the data fits in the memory. Its performance is much worse than SQL when your data is too huge. I do not know about others, but I would be surprised if NoSQL in general had much better on-disk performance than SQL.

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by lh332k
5
gravatar for vaskin90
7.4 years ago by
vaskin90290
Milan, Italy
vaskin90290 wrote:

If you continue using databases even after the answers above, my advice might be helpful.

Database performance is a very very up to the problem that you're trying to solve. It is a common wrong believe that SQLite has a bad performance. No way! Because reliability is a default setting of SQLite. If you switch off journaling, use indexes, use transactions properly (try to do a lot of things in one transaction) and cache prepared queries you will end up with a performance like here (http://www.sqlite.org/speed.html).

We use SQLite in our bioinformatics project and were able to process more than 80 GB of NGS data in minutes with it.

NoSQL would be better (in a general case) when you need scalability and sophisticated queries. When you just need to store your data and do simple queries then SQLite is better.

From our experience, the most difficult case of SQlite optimizations is when you have many small records (for instance SNPs) and you need to iterate over the set a lot. In that case it would be difficult to cache something or optimize. But if you have a bigger records (like sequences) then SQLite with its indexes is perfect.

ADD COMMENTlink written 7.4 years ago by vaskin90290

Can you provide more details on the way you store NGS data in SQLite db?

ADD REPLYlink written 7.4 years ago by Pawel Szczesny3.2k
1

Sure,

here is the project: ugene.unipro.ru. It's open source.

Our goal was to develop our Assembly Browser. We tried different techniques for storing reads in SQLite db. Naive attitude with simple index, multitables. But we end up with tiling (like in Google Maps) technique and R-tree index. Each tile (a rectangle) contains a number of reads and when you want to navigate to a specific location of your NGS data you use 4-dementional R-tree index and load only those tiles with reads that you need.

The attitude has one drawback. You need to import your BAM/SAM file in our internal format(with tiles and indexes), for a 80-GB file it will be ~40 minutes. But after that you get a full coverage graph and can instantly navigate to any part of your assembly. It is almost impossible to navigate big NGS data with other programs on the market, because they use another attitude that slows it down.

ADD REPLYlink written 7.4 years ago by vaskin90290

From the manual, it seems that you do not need BAM once it is imported. If so, I am not sure the purpose of importing an entire BAM. You can collect summary information, mainly read depth, and keep them in a database or in a binary format like IGV. This is going to be a small db/file. Detailed alignment should still be stored in BAM. I guess random access with sqlite is slower than with BAM. A sqlite db is probably much larger than BAM, too. Asking users to replicate data in a larger format only used by ugene is a significant disadvantage.

ADD REPLYlink written 7.4 years ago by lh332k
5
gravatar for ff.cc.cc
7.4 years ago by
ff.cc.cc1.3k
European Union
ff.cc.cc1.3k wrote:

I struggled with the same issue a few years ago (genotypes datasets ~ 2GB).

I agree with above good sense answers and tips (avoid relational dbs, split data, refactor and so on), but I know that sometimes you can't re-engineer the problem to work with sequential streams, or you need random access.

My best option was to use hdf5 storage engine. As they state in the site: "HDF technologies address the problems of how to organize, store, discover, access, analyze, share, and preserve data in the face of enormous growth in size and complexity".

I had to build libs from source under windows, but in linux precompiled packages are available in every distro. Then I customized a data format (a bunch of structured data tables) for storing SNPs and gene expressions, working in C\C++. Accessing data is possible i) visually through 3rd party tools like hdfView or intel array visualizer or ii) sistematically through API calls.

Performance are INCREDIBLE: epistasis test (like plink --fast-epi) run as in-memory bed files, also genome-wide eQTL tests on 60 CEU samples run in less than 1 hour.

The core of the code is something like this...

class h5TPED_I {

  protected:
    typedef struct {..    .} T_SNP;
    typedef struct {...} T_Sample;
    typedef struct {...} T_Gene;
    typedef struct {...} T_RefSeq;

   // file metadata
    ...
   // create a new file and its data structure
   virtual bool buildStruct()=0;

   // depndent build methods
   virtual int doDataTable()=0;
   virtual int doSampleTable()=0;
   virtual int doSNPTable()=0;
   virtual int doExpressionTable()=0;

   // setters
   virtual void setData(const std::string &table, const int row, const int col, const T_ExpType val)=0;
   virtual void setData(const std::string &table, const int row, const int col, const T_GType val)=0;
   virtual void setData(const std::string &table, const int row, const T_Sample &val)    =0;
   virtual void setData(const std::string &table, const int row, const T_SNP &val)        =0;
   virtual void setData(const std::string &table, const int row, const T_RefSeq &val)    =0;       
   //virtual void setData(const std::string &table, const int row, const long &val)=0;

   // getters
   virtual void getData(const std::string &table, const int row, const int col, T_ExpType &val)const =0;
   virtual void getData(const std::string &table, const int row, const int col, T_GType &val)    const =0;
   virtual void getData(const std::string &table, const int row, T_Sample &val)const =0;
   virtual void getData(const std::string &table, const int row, T_SNP &val)    const =0;
   virtual void getData(const std::string &table, const int row, T_RefSeq &val)const =0;
   //virtual void getData(const std::string &table, const int row, long &val) const =0;

    ...
   // function to build indexes
   virtual bool buildIndex()

   public:

   // Empty constructor
   h5TPED_I();

   // Constructor from existing file
   h5TPED_I(const std::string &szFilename);

   // val points to memory buffer in which SNP is loaded
   virtual void getSnpPtr(const int row, T_GType *&val, const std::string &table = "/SNPDataTableInv") const = 0;
   virtual void getSnpSubsetMem(const int snpInd, T_GType *val, const size_t mask_sz, const hsize_t *mask, const std::string &table) const {};
   //                        
   virtual void getSamplePtr(const int sampInd, T_GType *&val, const std::string &table = "/SNPDataTable") const = 0;

   //
   virtual void getSampleMem(const int sampInd, T_GType *val, const std::string &table = "/SNPDataTable") const = 0;

   //
   virtual void getGxpPtr(const int row, T_ExpType *&val, const std::string &table = "/ExpDataTable") const =0;

   //
   // General Info ------------------------------------------------------------------------------------------------------------------------
   std::string filename() const { return m_filename; };

   inline unsigned numSamples() const { return m_nSamples; };
   inline unsigned numSnps() const { return m_nSnp; };       
   inline unsigned numChrs() const { return m_nChr; };
   inline unsigned numGenes()const { return m_nGenes; };

   // default value for NA data
   inline T_GType NA() const { return -1; }

    ...

It's hosted on bitbucket and It's still private, since I would like to do some code-cleaning, but It's working fine.

If someone is interested and would work to refinement, plugin, extension or benchmark development please let me know.

ADD COMMENTlink modified 7.4 years ago by Giovanni M Dall'Olio26k • written 7.4 years ago by ff.cc.cc1.3k
1

Please edit this to make it readable. Prefix lines of code with 4 spaces.

ADD REPLYlink written 7.4 years ago by Neilfws48k

my problem with HDF5 is that it is very difficult to query when the data should be accessed without an index (e.g. find-by-name). The code is highly verbose too.

ADD REPLYlink written 7.4 years ago by Pierre Lindenbaum127k

Yes, in my case I build a memory map like <rsid, hdf_index="">. A 1MB map can index many thousands of SNPs. Sorry for verbosity. I also encountered a few issues pasting readable code in the answer.

ADD REPLYlink written 7.4 years ago by ff.cc.cc1.3k

I meant verbose in general (in C, you have to open/close every step of the HDF5 workflow as far as I remember)

ADD REPLYlink written 7.4 years ago by Pierre Lindenbaum127k
4
gravatar for wdiwdi
7.4 years ago by
wdiwdi380
Germany
wdiwdi380 wrote:

If you only need keyword/data storage. as you wrote, and that is your bottleneck, I recommend that you look at the Tokyo Cabinet/Kyoto Cabinet storage engines. These are probably the fastest and most powerful options for this type of storage needs.

ADD COMMENTlink written 7.4 years ago by wdiwdi380
2
gravatar for Pierre Lindenbaum
7.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

I use BerkeleyDB, ( either C or java version). There is a binding for python as far as I remember.

http://www.oracle.com/technetwork/products/berkeleydb/downloads/index.html

My only problem with it, is that it can be very difficult to configure when you want to tune your application (cache, transactions...).

The C package also includes a sqlite software that use the library.

ADD COMMENTlink written 7.4 years ago by Pierre Lindenbaum127k

That was what my advisor recommended, however it has been deprecated since 2.7 http://docs.python.org/2/library/bsddb.html However, as I doubt 2.7 will go out of use soon, perhaps I should try it.

PS: Bindings exist Python3, see http://pypi.python.org/pypi/bsddb3/

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by Click downvote670
2
gravatar for William
7.4 years ago by
William4.6k
Europe
William4.6k wrote:

If possible always stream trough big datasets. Often it is not neccesary to keep all the data in memory all the time.

Look at the difference for instance between DescriptiveStatistics and SummaryStatistics of Apache Commons Math. Both compute statistics but only SummaryStatistics will work on a big data set because it only keeps one record of the dataset in memory at a single point in time. DescriptiveStatistics crashes on a out of memory error very soon for big data sets.

http://commons.apache.org/math/userguide/stat.html#a1.2_Descriptive_statistics

[quote] DescriptiveStatistics maintains the input data in memory and has the capability of producing "rolling" statistics computed from a "window" consisting of the most recently added values.

SummaryStatistics does not store the input data values in memory, so the statistics included in this aggregate are limited to those that can be computed in one pass through the data without access to the full array of values. [/quote]

The best thing to do (if possible) is to refactor your code so you only have one or a limited set of records in memory at a single point in time.

ADD COMMENTlink written 7.4 years ago by William4.6k
2
gravatar for KCC
7.4 years ago by
KCC4.0k
Cambridge, MA
KCC4.0k wrote:

To what extent can you recode your data? For instance, you might be storing data as a number when it has fewer than 256 unique values and could be stored as a character and decoded as needed. You might be storing some values as floating point, when they only have one decimal place and could be multiplied by 10 and stored as integers. I am a little out of my depth here because you are using python. In C or C++, one has a lot of control over the relative size of the things one is storing.

Also, don't neglect the possibility of buying more RAM vs. how much time it might take you to recode all this. I recently bought more RAM and my big data programming tasks have become vastly simpler.

ADD COMMENTlink written 7.4 years ago by KCC4.0k
2
gravatar for Manu Prestat
7.4 years ago by
Manu Prestat4.0k
Lyon, France
Manu Prestat4.0k wrote:

If your purpose is to deal with fastq or fasta in python, I would recommend you to use the screed module. It parses, indexes and writes your sequences files into a DB file. Then you can read it and use it like if it was a dict() loaded into memory.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Manu Prestat4.0k
2
gravatar for Giovanni M Dall'Olio
7.4 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

Since you use python you can have a look at the libraries for HDF5.

HDF5 is a binary format used in physics and other fields, where there is the need of storing large datasets.

There are a couple of python libraries python which allows to use it more or less like an array. One is called PyTables, and the other HDF5 for Python. PyTables is a bit more advanced, but HDF5 for python works well too. Have a look at their documentation, both libraries are well described.

ADD COMMENTlink written 7.4 years ago by Giovanni M Dall'Olio26k
2
gravatar for ole.tange
6.0 years ago by
ole.tange3.7k
Denmark
ole.tange3.7k wrote:

Depending on your data the data structures may compress very well in RAM. If that is the case zram (https://en.wikipedia.org/wiki/Zram) can give a huge boost: Instead of swapping to disk, it swaps to a compressed RAM disk. The compression to RAM is much faster than disk I/O.

ADD COMMENTlink written 6.0 years ago by ole.tange3.7k
1
gravatar for Micans
7.4 years ago by
Micans270
Micans270 wrote:

There are good answers already; if you can, always stream or chunk the data, avoid databases. You've given us very little information about the actual problem at hand though. Quite often the particular problem leads to particular solutions. In network analysis one could work with pruned networks, where edges are removed based on absolute weight threshold or on weight ranking. Different software packages can have very different levels of overhead. If your problem has to do with read mapping there will be a host of other considerations. In general, look for ways to reduce problem size AND look for bigger hardware. Run your software on large (but not huge) samples to get an idea of where the bottleneck lies.

Edit: If the goal is deduplication then a simple approach is to do multiple passes, each pass only collecting data that has a particular trait. For reads this could be the first base or the first two bases. A read-specific approach that I've taken is to compress the reads in memory. It should be possible to achieve about 3.5-fold compression using the simplistic 2-bit encoding of 4 bases. I've used a length-encoding approach that is able to handle Ns as well. This just shows that it is possible to improve using particular characteristics of the problem at hand, but we do not know your problem.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Micans270
1
gravatar for kajendiran56
7.4 years ago by
kajendiran56120
kajendiran56120 wrote:

Some good answers here already. I once wrote a large data file but arranged data in a way that simply using grep within my script and running simultaneously on multiple threads achieved what I wanted. There are better solutions above, just thought I would add this as it has helped me before when I just wanted to avoid spending too much time.

ADD COMMENTlink written 7.4 years ago by kajendiran56120
1
gravatar for Amos
6.0 years ago by
Amos40
European Union
Amos40 wrote:

SciDB might be an alternative, it is designed for some fairly cool in-database computation. It has API's for R and Python.

ADD COMMENTlink written 6.0 years ago by Amos40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1781 users visited in the last hour