Question

Best practices for data from CellXgene

1

Entering edit mode

17 months ago

William ▴ 10

Hi,

I downloaded a dataset from CellXGene in hdf5 format and have been trying to use it for further analyses. Are there any standard steps one take before re-using data like this from a database? I.e. Should I just remove the original expression matrix from the object (Adata.X) and make a new hdf5 file, or can I leave all the metadata and use the original hdf5. My concern is that some of the extra stuff in the anndata may be interfering with my analysis. Is the data in adata.X the original data or has it already been manipulated?

CellXGene sc-seq • 851 views

ADD COMMENT • link updated 16 months ago by mczerwinski ▴ 40 • written 17 months ago by William ▴ 10

score 4 · Answer 1 · 2022-11-30

Are there any standard steps one take before re-using data like this from a database?

All RNA data has raw read/UMI counts, some data may have “original” normalized signal as provided by the contributors. The only thing you need to do is to check whether the data has only raw counts, or raw and normalized signal.

This is easy: if adata.raw.X exists, then it has both raw and normalized, adata.X is normalized and adata.raw.X is raw counts/UMI if adata.raw.X doesn’t exist, then it has only raw, adata.X is the raw counts/UMI The raw counts/UMI are unmodified as they come out of the processing of raw reads Nothing else in the adata will “interfere” with analysis, if anything it can enrich it by providing standard gene and cell metadata. It should just reside in the object without any effect unless you specifically interact with it.

More details here: https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md

New datasets are being added all the time: CZ CELLXGENE Datasets

enter image description here