Question: Classifying the samples based on zscore of a specific gene
0
gravatar for Biologist
10 months ago by
Biologist190
Biologist190 wrote:

I'm interested in checking the association of a gene to some clinical parameters. For that I'm classifying the samples into high and low based on a gene GABRD zscore values. I have the fpkm data and calculated zscore.

I took the cutoff Z=1 (very relaxed threshold)

So, zscore >=1 are classified as GABRD high. But I don't see any samples with zscore <= -1 to classify them into GABRD low.

Is it ok if I take zscore >=1 as high and zscore <=1 as low

thanq.

zscore rna-seq geneexpression R • 530 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by Biologist190
1

I think that something went wrong with the analysis. Could you provide the plot of your data? In R it can be made like: plot(density(data)), you may remove the names and all the IDs. Having no samples with z-score < -1 is very, very suspicious. Most probably you should not use z-scores. You can use z-scores only if your random variable is distributed in a bell shaped manner (see answer below). Your distribution is likely right-skewed (https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/skewed-distribution/ )

That's the short answer (actually absence of the answer) why you should not use z-score:

https://stats.stackexchange.com/questions/32357/can-i-use-a-z-score-with-skewed-and-non-normal-data

ADD REPLYlink modified 10 months ago • written 10 months ago by German.M.Demidov1.8k
4
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe63k
Kevin Blighe63k wrote:

Ideally, you should be aiming for absolute Z = 1.96 as the cut-off. On a two-tailed distribution, this is equivalent to p = 0.05. This being said, you do not have to define the Z score cut-offs in terms of probabilities - just be aware that Z = 1 is not a statistically significantly heightened level, though.

Here, this graphic is pretty neat: ggggg

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

Also, defining 'low' as Z<=1 is somewhat misleading, as any Z-score greater than 0 is technically higher above the mean of your dataset, and thus has heightened expression.

Important to consider:

  • how have you pre-processed your data?
  • how have you calculated Z-scores? (by row?; by column?; ...just using the entire dataset?)

Kevin

ADD COMMENTlink modified 10 months ago • written 10 months ago by Kevin Blighe63k

I got the fpkm expression data of TCGA and using zFPKM function and converted them to zscore. So, now how should I classify? I don't have any zscore values above 1.96. and no zscore values below -1.96.

ADD REPLYlink written 10 months ago by Biologist190

have you used log-transform of fpkm before applying z-score?

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k

No, I didn't log transform. I thought I have to. But the zFPKM documentation there is a note saying that the data is not log2 transformed.

ADD REPLYlink written 10 months ago by Biologist190

try to apply zFPKMPlot from https://bioconductor.org/packages/release/bioc/vignettes/zFPKM/inst/doc/zFPKM.html and check if your distribution is right skewed. Not having z-scores less than -1 is, khm, totally unbelievable if the analysis was correct. As you can see from the plot above, around 15% of your values should be < -1.

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k

I took the zscore of the GBARD gene, and density plot looks like this.

ADD REPLYlink written 10 months ago by Biologist190
2

well. it looks bad. it is not right-skewed, but. you have two options. you cut your left tail (these are - probably - technical artifacts - you have to understand it yourself) or you use Qn (https://cran.r-project.org/web/packages/robustbase/robustbase.pdf) as a measure of standard deviation for your z-scores and median as a measure of central tendency. something like z-score = (data - median(data)) / Qn(data)

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k
1

Ok. I took the fpkm data of that gene and applied the below function to get the zscore.

zscore<- function(x){
    z<- (x - mean(x)) / sd(x)
    return(z)
}

gabrd_z <- zscore(gabrd_fpkm)

And then made a density plot on gabrd_z. I see that it is right-skewed.

density plot on zscore of GABRD

But now on which cutoff I have to classify into high and low?

ADD REPLYlink modified 10 months ago • written 10 months ago by Biologist190
1

nah, it does not look that bad. the skewness may be neglected if you do it as a rough analysis. so, now you have REAL z-scores and this is good =) you may proceed with your analysis. The choice of cutoff will depend on what are you trying to say with these values (what high means for you? what low means for you from the biological persepctive?)

ADD REPLYlink modified 10 months ago • written 10 months ago by German.M.Demidov1.8k

So, basically I wanted to classify around 600 samples into GABRD high and GABRD low groups and check the association with some clinical parameters. I want to use all these 600 samples for the analysis. But if I take +1.96 and -1.96 as cutoff for high and low I may be able to use only 50 samples for my analysis. so, I'm really confused on what basis I have to choose the cutoff?

Can I consider all the samples with positive values as high and negative as low?

ADD REPLYlink modified 10 months ago • written 10 months ago by Biologist190

I'd recommend you to use regression for the association and use this z-score as a continuous predictor. Here is a useful explanation: https://stats.stackexchange.com/questions/16565/what-is-the-effect-of-dichotomising-variables .

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k

small help please. how this dichotomisation can be done on zscore values in R?

Can you please give an example.

ADD REPLYlink modified 10 months ago • written 10 months ago by Biologist190
1

no-no, dichotomization is what you're trying to do. dichotomization is basically division your variable into 2 groups (high or low expressed genes). this procedure is not recommended in general. put your scores into the regression model and check the association without division into groups. Like, to predict weight of people based on their height, you will not divide your height into 2 groups (tall and short people), but use the raw value in centimeters instead.

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k

Ok. sorry I misunderstood. But for my analysis division into groups is what I want.

ADD REPLYlink written 10 months ago by Biologist190
1

then try different thresholds (±1.96, ±1.28, ±1.04, etc - qnorm(some_round_number_from_0_to_1) ) - and choose the one that will give you significant p-value =)

no, seriously, then choose ±1.96, it sounds reasonable. You'll have only 50 samples - but that's what you want.

and drawing a scatterplot plot(clinical_outcome ~ z-score) is always a good practice.

ADD REPLYlink modified 10 months ago • written 10 months ago by German.M.Demidov1.8k

Hey hi small help again. So, I have the data like below:

df:

Samples GABRD   Gender  Stage
Sample1 0.002   Female  A
Sample2 0.233   Female  A
Sample3 1.527   Female  B
Sample4 -3.45   Male    C
Sample5 0.79    Male    B
Sample6 2.19    Male    A
Sample7 0.42    Female  C
Sample8 -1.01   Male    A
Sample9 0.627   Female  B
Sample10 -0.23  Male    B

For the checking the relationship just using lm like below is fine?

lm(GABRD ~ Gender + Stage, data = df)

or I have to check the relationship with Gender and Stage separately?

lm(GABRD ~ Gender, data = df)
lm(GABRD ~ Stage, data = df)
ADD REPLYlink written 10 months ago by Biologist190
1

Hi, in my opinion - only together. May be include interaction term (Gender * Stage) in the model (you have enough samples as I understood). Be sure to perform regression diagnostics. https://data.library.virginia.edu/diagnostic-plots/

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k

Or else as I have fpkm expression data of that gene, can I take median as a cutoff and classify them into high and low?

ADD REPLYlink written 10 months ago by Biologist190

the plot that you've shown is not a plot of z-score. it is not centered around 0. yes, you can use median, as well as any other value to say if your genes are high or low. but what do you want to get from such classification?

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k

Your given figure is the standard population distribution. And the actual question is the iid sample distribution. I think the point estimation should be done first to estimate the u and delta of the population distribution based on the observed data.

ADD REPLYlink written 10 months ago by shoujun.gu310

sigma, not delta =)

ADD REPLYlink written 10 months ago by German.M.Demidov1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 675 users visited in the last hour