Question

Scanpy Pearson residual PCA error

0

Entering edit mode

20 months ago

Emily ▴ 70

I got a

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

when trying to run this part of the code

sc.pp.pca(adata, n_comps=50)
n_cells = len(adata)
sc.tl.tsne(adata, use_rep="X_pca")

Not sure if the cause of error is becauseI I merge 4 10x sample files into one adata, just used .concatenate() command, or it's because of something else.

Everything was working well until that step

python scRNA scanpy PCA • 2.3k views

ADD COMMENT • link updated 4 months ago by Maëlick • 0 • written 20 months ago by Emily ▴ 70

0

Entering edit mode

Seems like merging samples introduced some missing values.

ADD REPLY • link 20 months ago by Arup Ghosh 3.2k

0

Entering edit mode

Your sample elements are the same in each of the objects? Maybe you are using a different merging strategy.

ADD REPLY • link 20 months ago by zorbax ▴ 610

0

Entering edit mode

The variables are the same but the n_obs is slightly different but not too much (13818, 13829, 13830, 13843); 4 files (runs).

I followed this strategy https://scanpy.discourse.group/t/merge-multiple-10x-samples/184

adata_pbmc_normal_L001 = sc.read_10x_mtx('~/pbmc_normal_L001/outs/filtered_feature_bc_matrix/')
adata_pbmc_normal_L002 = sc.read_10x_mtx('~/pbmc_normal_L002/outs/filtered_feature_bc_matrix/')
adata_pbmc_normal_L003 = sc.read_10x_mtx('~/pbmc_normal_L003/outs/filtered_feature_bc_matrix/')
adata_pbmc_normal_L004 = sc.read_10x_mtx('~/pbmc_normal_L004/outs/filtered_feature_bc_matrix/')

adata = adata1.concatenate(adata2, adata3, adata4, index_unique=None)

I didn't do any filtration prior to concatenating the dataset.

ADD REPLY • link updated 6 months ago by Ram 43k • written 20 months ago by Emily ▴ 70

0

Entering edit mode

4 months ago

Maëlick • 0

I had the same problems and then realize that I forgot the preprocessing! If you add this commands before running the PCA, it should work:

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

Also, you don't nead to add fill_value=np.nan. it will only create a non sparse matrix that will speed down by a lot the computation.

ADD COMMENT • link 4 months ago by Maëlick • 0

Ram · Accepted Answer · 2022-08-18

1

Entering edit mode

20 months ago

zorbax ▴ 610

You can use:

import numpy as np

adata = ad.concat([adata1, adata2, adata3, adata4], join="outer", fill_value=np.nan)

With NumPy, you will fill all the missing values with a float instead of the Pandas NaN missing value.

ADD COMMENT • link updated 6 months ago by Ram 43k • written 20 months ago by zorbax ▴ 610

0

Entering edit mode

not sure ad.concat() is a typo... did you mean you mean pd.concat(....)?

ADD REPLY • link 20 months ago by Emily ▴ 70

0

Entering edit mode

ad.concat() https://anndata.readthedocs.io/en/latest/generated/anndata.concat.html

ADD REPLY • link 20 months ago by zorbax ▴ 610

0

Entering edit mode

still shows as ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). when running the code at this part (below) after your suggested change:

sc.pp.pca(adata, n_comps=50)
n_cells = len(adata)
sc.tl.tsne(adata, use_rep="X_pca")

ADD REPLY • link 20 months ago by Emily ▴ 70

0

Entering edit mode

Without any clue about your dataset, you can use adata.obs = adata.obs.fillna(np.mean). The PCA implementation from scikit-learn doesn't allow NaN values.

ADD REPLY • link 20 months ago by zorbax ▴ 610

0

Entering edit mode

The data is from here converting sra file into fastq then into count matrix using 10X cell ranger producing 4 files. Each file dimension is AnnData object with n_obs × n_vars: 13818 × 36601, 13829 × 36601, 13830 × 36601,13843 × 36601 with var: 'gene_ids', 'feature_types'

when trying to merge all 4 dataset using adata = ad.concat([adata1,adata2,adata3,adata4], join="outer, fill_value=np.nan) followed by adata.obs = adata.obs.fillna(np.mean)

I get a warning message UserWarning: Observation names are not unique. To make them unique, call '.obs_names_make_unique'. which shouldn't be a big problem (i think) then I ran adata.var_names_make_unique() sc.pp.filter_genes(adata, min_cells=1) #remove empty cells sc.external.pp.scrublet(adata) to remove doublets

rest of the code is the same as the Scanpy's pearson residual tutorial but keeps getting stuck at PCA step saying ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

ADD REPLY • link 20 months ago by Emily ▴ 70