Scanpy Pearson residual PCA error
2
0
Entering edit mode
20 months ago
Emily ▴ 70

I got a

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

when trying to run this part of the code

sc.pp.pca(adata, n_comps=50)
n_cells = len(adata)
sc.tl.tsne(adata, use_rep="X_pca")

Not sure if the cause of error is becauseI I merge 4 10x sample files into one adata, just used .concatenate() command, or it's because of something else.

Everything was working well until that step

python scRNA scanpy PCA • 2.3k views
ADD COMMENT
0
Entering edit mode

Seems like merging samples introduced some missing values.

ADD REPLY
0
Entering edit mode

Your sample elements are the same in each of the objects? Maybe you are using a different merging strategy.

ADD REPLY
0
Entering edit mode

The variables are the same but the n_obs is slightly different but not too much (13818, 13829, 13830, 13843); 4 files (runs).

I followed this strategy https://scanpy.discourse.group/t/merge-multiple-10x-samples/184

adata_pbmc_normal_L001 = sc.read_10x_mtx('~/pbmc_normal_L001/outs/filtered_feature_bc_matrix/')
adata_pbmc_normal_L002 = sc.read_10x_mtx('~/pbmc_normal_L002/outs/filtered_feature_bc_matrix/')
adata_pbmc_normal_L003 = sc.read_10x_mtx('~/pbmc_normal_L003/outs/filtered_feature_bc_matrix/')
adata_pbmc_normal_L004 = sc.read_10x_mtx('~/pbmc_normal_L004/outs/filtered_feature_bc_matrix/')

adata = adata1.concatenate(adata2, adata3, adata4, index_unique=None)

I didn't do any filtration prior to concatenating the dataset.

ADD REPLY
1
Entering edit mode
20 months ago
zorbax ▴ 610

You can use:

import numpy as np

adata = ad.concat([adata1, adata2, adata3, adata4], join="outer", fill_value=np.nan)

With NumPy, you will fill all the missing values with a float instead of the Pandas NaN missing value.

ADD COMMENT
0
Entering edit mode

not sure ad.concat() is a typo... did you mean you mean pd.concat(....)?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

still shows as ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). when running the code at this part (below) after your suggested change:

sc.pp.pca(adata, n_comps=50)
n_cells = len(adata)
sc.tl.tsne(adata, use_rep="X_pca")
ADD REPLY
0
Entering edit mode

Without any clue about your dataset, you can use adata.obs = adata.obs.fillna(np.mean). The PCA implementation from scikit-learn doesn't allow NaN values.

ADD REPLY
0
Entering edit mode

The data is from here converting sra file into fastq then into count matrix using 10X cell ranger producing 4 files. Each file dimension is AnnData object with n_obs × n_vars: 13818 × 36601, 13829 × 36601, 13830 × 36601,13843 × 36601 with var: 'gene_ids', 'feature_types'

when trying to merge all 4 dataset using adata = ad.concat([adata1,adata2,adata3,adata4], join="outer, fill_value=np.nan) followed by adata.obs = adata.obs.fillna(np.mean)

I get a warning message UserWarning: Observation names are not unique. To make them unique, call '.obs_names_make_unique'. which shouldn't be a big problem (i think) then I ran adata.var_names_make_unique() sc.pp.filter_genes(adata, min_cells=1) #remove empty cells sc.external.pp.scrublet(adata) to remove doublets

rest of the code is the same as the Scanpy's pearson residual tutorial but keeps getting stuck at PCA step saying ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

ADD REPLY
0
Entering edit mode
4 months ago
Maëlick • 0

I had the same problems and then realize that I forgot the preprocessing! If you add this commands before running the PCA, it should work:

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

Also, you don't nead to add fill_value=np.nan. it will only create a non sparse matrix that will speed down by a lot the computation.

ADD COMMENT

Login before adding your answer.

Traffic: 1926 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6