Question

Best compression for single cell RNA-seq object

0

Entering edit mode

3.1 years ago

MutationalMeltdown ▴ 180

I generated a scRNA-seq object (counts, PCA, UMAP embeddings, DEGs etc.) in Scanpy or Seurat. What is the best data structure to store this in to reduce the size of the object?

I'm considering H5AD (scanpy/anndata), RDS or H5Seurat (Seurat), or Loom

Fast loading/access would also be good of course, thanks

seurat rna-seq scanpy single-cell scRNA-seq • 3.1k views

ADD COMMENT • link updated 2.7 years ago by chansigit • 0 • written 3.1 years ago by MutationalMeltdown ▴ 180

0

Entering edit mode

For archival purposes? or for access in other tools? gzip fast to decompress on the fly as well.

ADD REPLY • link 3.1 years ago by Istvan Albert 100k

0

Entering edit mode

To access, imagine a database of objects. H5AD supports gzip compression and I use it. The main issue is the counts matrix itself (even a sparse matrix tends to be far larger than the annotations). All the approaches I suggest (apart from RDS) use HDF5 format, which seems pretty optimised, but I'm interested in the differences between them and any alternatives

ADD REPLY • link 3.1 years ago by MutationalMeltdown ▴ 180

0

Entering edit mode

I don't fully understand the problem and its requirements, but gut feeling wise I would steer far-far away from RDS, and instead would design it by relying on sparse matrix save with scipy/numpy (scipy.sparse.save_npz) then model the rest of the information as a relational database in SQlite.

I feel that would give the highest level of flexibility and extensibility for the future.

ADD REPLY • link 3.1 years ago by Istvan Albert 100k

0

Entering edit mode

Right, but that's similar what the anndata (the object written to H5AD) is already - isn't that reinventing the wheel or am I missing something? https://anndata.readthedocs.io/en/latest/

ADD REPLY • link 3.1 years ago by MutationalMeltdown ▴ 180

0

Entering edit mode

if it is H5AD then can't be a relational database, right?

I honestly think that storing biological data in hdf5 format is a mistake, relational databases are more elegant, robust and simple to use from any language or no language at all,

ADD REPLY • link 3.1 years ago by Istvan Albert 100k

0

Entering edit mode

the problem with relational database is the scalability and dimensionality is beyond relational database's limits. for mysql, only hundreds of numerical columns could be hold. for oracle, 1k column is its upper limit. We are building a specialized database called unified giant table holding the large scale omics data.

ADD REPLY • link 2.7 years ago by chansigit • 0