Question

Extract datasets from .h5 file

0

Entering edit mode

4.1 years ago

JulianC ▴ 30

Hi!

I retrieved single-cell data from GEO datasets (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3489183). The file format is .h5, produce by CellRanger V2.0 pipeline (10x Genomics). To open it and to have a look at the datasets inside, I used the following Python code:

import h5py
import pandas
import numpy

f = h5py.File('GSM3489183_IPF_01_filtered_gene_bc_matrices_h5.h5', 'r')
list(f.keys())

['GRCh38']

dset = f['GRCh38']
list(dset)

['barcodes', 'data', 'gene_names', 'genes', 'indices', 'indptr', 'shape']

According to CellRanger manual, the dataset called 'data' should contain the Nonzero UMI counts in column-major order, The 'shape' dataset is a tuple of (# rows, # columns) indicating the matrix dimensions. Each of these datasets has 1 column. To see the relative data I used the code:

a = np.array(f['GRCh38/data'])
pd.DataFrame(a)

However, I don't see how I can retrieve, from this data, a table in which genes are rows and cells are columns. The 'data' datasets must be the expression data about each gene, in each cell, but since it is a 1-column dataset, I don't see how I can build a table with cells as columns with the relative data for each gene. Do you have experience with this type of file? Thank you in advance!

Single-cell Cell ranger Python • 5.6k views

ADD COMMENT • link 4.1 years ago by JulianC ▴ 30

score 1 · Answer 1 · 2020-04-02

1

Entering edit mode

4.1 years ago

GenoMax 141k

Take a look at this 10x genomics page that describes how to work with h5 data.

ADD COMMENT • link 4.1 years ago by GenoMax 141k