Result of scanpy tools rank_genes_groups
1
0
Entering edit mode
12 months ago
el24 ▴ 20

Hi everyone, I'm clustering my single-cell data using Scanpy package, and I use rank_gene_groups to rank genes for characterizing groups. This returns names, scores, logfoldchanges, and pvals_adj for my Anndata format. I'm trying to interpret scores for this data, but I couldn't find any information except their source code. The values for scores are usually between about 1.2 to 39. This range doesn't make sense based on their definition, as below:

scores: structured `np.ndarray` (`.uns['rank_genes_groups']`)
    Structured array to be indexed by group id storing the z-score
    underlying the computation of a p-value for each gene for each
    group. Ordered according to scores.

A couple of lines from my result is shown as follows. 0-n means the 'name' of cluster 0, followed by 0_s as 'scores', 0-p as 'pvals_adj', and 0_l as 'logfoldchanges' for cluster 0. Afterward, we have the same data for other clusters. (I have 5 clusters, but not all are shown in the image)

Screen-Shot-2020-04-30-at-00-36-31

Could you please tell me what "scores" really mean in this result? Your help would be much appreciated! Thanks a lot in advance!

gene scanpy seurat rank_gene_groups clustering • 1.1k views
ADD COMMENT
0
Entering edit mode

What makes you think 1.2 to 39 doesn't make sense for z-score?

ADD REPLY
0
Entering edit mode

Hi @shoujun.gu

The reason is that they are all positive values so I was wondering how could it be possible?! unless these are absolute values for z score, but I am not sure.

ADD REPLY
0
Entering edit mode

The default setting of "tl.rank_genes_groups" is to return top 100 marker gene. I think all positive z-score is expected.

ADD REPLY
0
Entering edit mode

Yes, it's correct that the default is n_genes=100, but I'm using all available genes as n_genes=adata.to_df().shape[1] in my code.

ADD REPLY
1
Entering edit mode

In the source code, 'rank_genes_groups' uses stats.ttest_ind_from_stats() for default method, which returns t-statistics. You may double check the value of your adata.shape[1], or try using something like adata.raw.shape[1] or adata.X.toarray().shape[1].

ADD REPLY
0
Entering edit mode

Sorry for the confusion, I updated my response above as I forgot to write the to_df() function above but I had it in my code. I have used n_genes=adata.to_df().shape[1] so that part is fine and I could see the size of the output is as I expect. Although the other way also works for me - without using to_df()

ADD REPLY
0
Entering edit mode
8 months ago

Hi @el24,

I have a question for you, I also would like to use scanpy for getting differntially expressed genes between to time-points data. But I do not know how to write my data in the AnnData format, if you know, could you please help me?

For each time-point, I have 200 cells and 10000 genes, and a gene expression matrix with this dimension. Let suppose A0 and A10 are the gene expression matrices for time-pints 0 and 10. c1-c200 are the cell names for time-point 0 and c201-c400 are the cell names for time-point 10.

I would be thankful if you help me to solve this issue.

ADD COMMENT

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6