Question

Result of scanpy tools rank_genes_groups

0

Entering edit mode

4.0 years ago

el24 ▴ 40

Hi everyone, I'm clustering my single-cell data using Scanpy package, and I use rank_gene_groups to rank genes for characterizing groups. This returns names, scores, logfoldchanges, and pvals_adj for my Anndata format. I'm trying to interpret scores for this data, but I couldn't find any information except their source code. The values for scores are usually between about 1.2 to 39. This range doesn't make sense based on their definition, as below:

scores: structured `np.ndarray` (`.uns['rank_genes_groups']`)
    Structured array to be indexed by group id storing the z-score
    underlying the computation of a p-value for each gene for each
    group. Ordered according to scores.

A couple of lines from my result is shown as follows. 0-n means the 'name' of cluster 0, followed by 0_s as 'scores', 0-p as 'pvals_adj', and 0_l as 'logfoldchanges' for cluster 0. Afterward, we have the same data for other clusters. (I have 5 clusters, but not all are shown in the image)

Could you please tell me what "scores" really mean in this result? Your help would be much appreciated! Thanks a lot in advance!

gene scanpy seurat rank_gene_groups clustering • 6.4k views

ADD COMMENT • link updated 3.6 years ago by Raz ▴ 10 • written 4.0 years ago by el24 ▴ 40

0

Entering edit mode

What makes you think 1.2 to 39 doesn't make sense for z-score?

ADD REPLY • link 4.0 years ago by shoujun.gu ▴ 350

0

Entering edit mode

Hi @shoujun.gu

The reason is that they are all positive values so I was wondering how could it be possible?! unless these are absolute values for z score, but I am not sure.

ADD REPLY • link 4.0 years ago by el24 ▴ 40

0

Entering edit mode

The default setting of "tl.rank_genes_groups" is to return top 100 marker gene. I think all positive z-score is expected.

ADD REPLY • link 4.0 years ago by shoujun.gu ▴ 350

0

Entering edit mode

Yes, it's correct that the default is n_genes=100, but I'm using all available genes as n_genes=adata.to_df().shape[1] in my code.

ADD REPLY • link 3.9 years ago by el24 ▴ 40

1

Entering edit mode

In the source code, 'rank_genes_groups' uses stats.ttest_ind_from_stats() for default method, which returns t-statistics. You may double check the value of your adata.shape[1], or try using something like adata.raw.shape[1] or adata.X.toarray().shape[1].

ADD REPLY • link 4.0 years ago by shoujun.gu ▴ 350

0

Entering edit mode

Sorry for the confusion, I updated my response above as I forgot to write the to_df() function above but I had it in my code. I have used n_genes=adata.to_df().shape[1] so that part is fine and I could see the size of the output is as I expect. Although the other way also works for me - without using to_df()

ADD REPLY • link 3.9 years ago by el24 ▴ 40

score 0 · Answer 1 · 2020-09-05

Hi @el24,

I have a question for you, I also would like to use scanpy for getting differntially expressed genes between to time-points data. But I do not know how to write my data in the AnnData format, if you know, could you please help me?

For each time-point, I have 200 cells and 10000 genes, and a gene expression matrix with this dimension. Let suppose A0 and A10 are the gene expression matrices for time-pints 0 and 10. c1-c200 are the cell names for time-point 0 and c201-c400 are the cell names for time-point 10.

I would be thankful if you help me to solve this issue.