Question: Result of scanpy tools rank_genes_groups
0
gravatar for el24
4 months ago by
el2410
USA
el2410 wrote:

Hi everyone, I'm clustering my single-cell data using Scanpy package, and I use rank_gene_groups to rank genes for characterizing groups. This returns names, scores, logfoldchanges, and pvals_adj for my Anndata format. I'm trying to interpret scores for this data, but I couldn't find any information except their source code. The values for scores are usually between about 1.2 to 39. This range doesn't make sense based on their definition, as below:

scores: structured `np.ndarray` (`.uns['rank_genes_groups']`)
    Structured array to be indexed by group id storing the z-score
    underlying the computation of a p-value for each gene for each
    group. Ordered according to scores.

A couple of lines from my result is shown as follows. 0-n means the 'name' of cluster 0, followed by 0_s as 'scores', 0-p as 'pvals_adj', and 0_l as 'logfoldchanges' for cluster 0. Afterward, we have the same data for other clusters. (I have 5 clusters, but not all are shown in the image)

Screen-Shot-2020-04-30-at-00-36-31

Could you please tell me what "scores" really mean in this result? Your help would be much appreciated! Thanks a lot in advance!

ADD COMMENTlink modified 18 days ago by zf.math20130 • written 4 months ago by el2410

What makes you think 1.2 to 39 doesn't make sense for z-score?

ADD REPLYlink written 4 months ago by shoujun.gu310

Hi @shoujun.gu

The reason is that they are all positive values so I was wondering how could it be possible?! unless these are absolute values for z score, but I am not sure.

ADD REPLYlink modified 4 months ago • written 4 months ago by el2410

The default setting of "tl.rank_genes_groups" is to return top 100 marker gene. I think all positive z-score is expected.

ADD REPLYlink written 4 months ago by shoujun.gu310

Yes, it's correct that the default is n_genes=100, but I'm using all available genes as n_genes=adata.to_df().shape[1] in my code.

ADD REPLYlink modified 4 months ago • written 4 months ago by el2410
1

In the source code, 'rank_genes_groups' uses stats.ttest_ind_from_stats() for default method, which returns t-statistics. You may double check the value of your adata.shape[1], or try using something like adata.raw.shape[1] or adata.X.toarray().shape[1].

ADD REPLYlink written 4 months ago by shoujun.gu310

Sorry for the confusion, I updated my response above as I forgot to write the to_df() function above but I had it in my code. I have used n_genes=adata.to_df().shape[1] so that part is fine and I could see the size of the output is as I expect. Although the other way also works for me - without using to_df()

ADD REPLYlink modified 4 months ago • written 4 months ago by el2410
0
gravatar for zf.math2013
18 days ago by
zf.math20130 wrote:

Hi @el24,

I have a question for you, I also would like to use scanpy for getting differntially expressed genes between to time-points data. But I do not know how to write my data in the AnnData format, if you know, could you please help me?

For each time-point, I have 200 cells and 10000 genes, and a gene expression matrix with this dimension. Let suppose A0 and A10 are the gene expression matrices for time-pints 0 and 10. c1-c200 are the cell names for time-point 0 and c201-c400 are the cell names for time-point 10.

I would be thankful if you help me to solve this issue.

ADD COMMENTlink written 18 days ago by zf.math20130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1639 users visited in the last hour