Question: Classifying the samples based on zscore of a specific gene
0
Biologist190 wrote:

I'm interested in checking the association of a gene to some clinical parameters. For that I'm classifying the samples into high and low based on a gene `GABRD` zscore values. I have the fpkm data and calculated `zscore`.

I took the cutoff Z=1 (very relaxed threshold)

So, zscore >=1 are classified as `GABRD high`. But I don't see any samples with zscore <= -1 to classify them into `GABRD low`.

Is it ok if I take zscore >=1 as `high` and zscore <=1 as `low`

thanq.

zscore rna-seq geneexpression R • 530 views
modified 10 months ago • written 10 months ago by Biologist190
1

I think that something went wrong with the analysis. Could you provide the plot of your data? In R it can be made like: plot(density(data)), you may remove the names and all the IDs. Having no samples with z-score < -1 is very, very suspicious. Most probably you should not use z-scores. You can use z-scores only if your random variable is distributed in a bell shaped manner (see answer below). Your distribution is likely right-skewed (https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/skewed-distribution/ )

That's the short answer (actually absence of the answer) why you should not use z-score:

4
Kevin Blighe63k wrote:

Ideally, you should be aiming for absolute Z = 1.96 as the cut-off. On a two-tailed distribution, this is equivalent to p = 0.05. This being said, you do not have to define the Z score cut-offs in terms of probabilities - just be aware that Z = 1 is not a statistically significantly heightened level, though.

Here, this graphic is pretty neat: [source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

Also, defining '`low`' as Z<=1 is somewhat misleading, as any Z-score greater than 0 is technically higher above the mean of your dataset, and thus has heightened expression.

Important to consider:

• how have you pre-processed your data?
• how have you calculated Z-scores? (by row?; by column?; ...just using the entire dataset?)

Kevin

I got the `fpkm` expression data of TCGA and using `zFPKM` function and converted them to `zscore`. So, now how should I classify? I don't have any zscore values above 1.96. and no zscore values below -1.96.

have you used log-transform of fpkm before applying z-score?

No, I didn't log transform. I thought I have to. But the `zFPKM` documentation there is a note saying that the data is not log2 transformed.

try to apply zFPKMPlot from https://bioconductor.org/packages/release/bioc/vignettes/zFPKM/inst/doc/zFPKM.html and check if your distribution is right skewed. Not having z-scores less than -1 is, khm, totally unbelievable if the analysis was correct. As you can see from the plot above, around 15% of your values should be < -1.

I took the zscore of the `GBARD` gene, and density plot looks like this.

2

well. it looks bad. it is not right-skewed, but. you have two options. you cut your left tail (these are - probably - technical artifacts - you have to understand it yourself) or you use Qn (https://cran.r-project.org/web/packages/robustbase/robustbase.pdf) as a measure of standard deviation for your z-scores and median as a measure of central tendency. something like `z-score = (data - median(data)) / Qn(data)`

1

Ok. I took the fpkm data of that gene and applied the below function to get the zscore.

``````zscore<- function(x){
z<- (x - mean(x)) / sd(x)
return(z)
}

gabrd_z <- zscore(gabrd_fpkm)
``````

And then made a density plot on `gabrd_z`. I see that it is right-skewed. But now on which cutoff I have to classify into high and low?

1

nah, it does not look that bad. the skewness may be neglected if you do it as a rough analysis. so, now you have REAL z-scores and this is good =) you may proceed with your analysis. The choice of cutoff will depend on what are you trying to say with these values (what high means for you? what low means for you from the biological persepctive?)

So, basically I wanted to classify around 600 samples into GABRD high and GABRD low groups and check the association with some clinical parameters. I want to use all these 600 samples for the analysis. But if I take +1.96 and -1.96 as cutoff for high and low I may be able to use only 50 samples for my analysis. so, I'm really confused on what basis I have to choose the cutoff?

Can I consider all the samples with positive values as high and negative as low?

I'd recommend you to use regression for the association and use this z-score as a continuous predictor. Here is a useful explanation: https://stats.stackexchange.com/questions/16565/what-is-the-effect-of-dichotomising-variables .

small help please. how this dichotomisation can be done on zscore values in R?

Can you please give an example.

1

no-no, dichotomization is what you're trying to do. dichotomization is basically division your variable into 2 groups (high or low expressed genes). this procedure is not recommended in general. put your scores into the regression model and check the association without division into groups. Like, to predict weight of people based on their height, you will not divide your height into 2 groups (tall and short people), but use the raw value in centimeters instead.

Ok. sorry I misunderstood. But for my analysis division into groups is what I want.

1

then try different thresholds (±1.96, ±1.28, ±1.04, etc - qnorm(some_round_number_from_0_to_1) ) - and choose the one that will give you significant p-value =)

no, seriously, then choose ±1.96, it sounds reasonable. You'll have only 50 samples - but that's what you want.

and drawing a scatterplot plot(clinical_outcome ~ z-score) is always a good practice.

Hey hi small help again. So, I have the data like below:

``````df:

Samples GABRD   Gender  Stage
Sample1 0.002   Female  A
Sample2 0.233   Female  A
Sample3 1.527   Female  B
Sample4 -3.45   Male    C
Sample5 0.79    Male    B
Sample6 2.19    Male    A
Sample7 0.42    Female  C
Sample8 -1.01   Male    A
Sample9 0.627   Female  B
Sample10 -0.23  Male    B
``````

For the checking the relationship just using `lm` like below is fine?

``````lm(GABRD ~ Gender + Stage, data = df)
``````

or I have to check the relationship with Gender and Stage separately?

``````lm(GABRD ~ Gender, data = df)
lm(GABRD ~ Stage, data = df)
``````
1

Hi, in my opinion - only together. May be include interaction term (Gender * Stage) in the model (you have enough samples as I understood). Be sure to perform regression diagnostics. https://data.library.virginia.edu/diagnostic-plots/

Or else as I have fpkm expression data of that gene, can I take median as a cutoff and classify them into high and low?

the plot that you've shown is not a plot of z-score. it is not centered around 0. yes, you can use median, as well as any other value to say if your genes are high or low. but what do you want to get from such classification?

Your given figure is the standard population distribution. And the actual question is the iid sample distribution. I think the point estimation should be done first to estimate the u and delta of the population distribution based on the observed data.