Question: t-test or linear regression?
0
gravatar for moushengxu
15 months ago by
moushengxu270
moushengxu270 wrote:

This is actually a generic question which does not have to be related to biology.

Given a numeric score "y" which looks normally distributed, and a category variable "x" which has two categories 0 & 1. To test if "y" is the same between category 0 & 1, I understand you can just do a t-test and get your p-value. The question is, how if I want to use linear regression -- e.g. glm(y ~ x)? I assumed the t-test & the glm would return the same p-value, but they didn't. Thought they were the same.

Thanks!

R • 1.8k views
ADD COMMENTlink modified 15 months ago by russhh2.7k • written 15 months ago by moushengxu270

Could you post the code that you used, please? Also, could you state whether your experiment is balanced, that is, is there the same number of samples for category 0 as for category 1?

ADD REPLYlink written 15 months ago by russhh2.7k

Code is so simply:

d<-read.delim("mydata.txt"); attach(d); d1<-subset(d[,"score"], category == 0); d2<-subset(d[,"score"], category == 1);

t.test

t.test(d1, d2, var.equal=T);

glm

summary(glm(score ~ category));

With "var.equal=F", t.test & glm gave different p-values. With "var.equal=T" they yielded the same p-value.

My experiment is not balanced and not paired.

ADD REPLYlink modified 10 weeks ago • written 15 months ago by moushengxu270

If the y variable is only 0|1, it would be more appropriate to do a logistic regression, e.g. summary(glm(y~x, family='binomial')). This will also give you an odds-ratio, an estimate of how much an increase in x corresponds to higher/lower odds of getting y==0.

In general I think the advantages of using a regression over a t-test are two: 1) you get an odds-ratio apart from a p-value 2) you can easily add more factors in if there are other variables.

ADD REPLYlink written 15 months ago by Giovanni M Dall'Olio25k

Yeah, good points. thanks.

ADD REPLYlink written 15 months ago by moushengxu270
6
gravatar for russhh
15 months ago by
russhh2.7k
UK, U. Glasgow
russhh2.7k wrote:

Just so you are aware: An assumption of ANOVA is that the standard deviations are identical in each category. The t-test that is used by R does not, by default, assume identical standard deviations in the two categories, although in text-books this is a common assumption. By setting var.equal=TRUE in the t.test code, you can recover the same p-values as obtained from an ANOVA implementation

# Example using unbalanced data
set.seed(1); library(magrittr)
x <- c(rep('a', 15), rep('b', 5)) %>% factor
y <- rnorm(20)

# t-Test with default settings: ie, equal sd for each group is not assumed (var.equal = FALSE)
t.test(formula = y ~ x)    

#         Welch Two Sample t-test
# 
# data:  y by x
# t = -1.0707, df = 15.609, p-value = **0.3006**
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -1.0704444  0.3529962
# sample estimates:
# mean in group a mean in group b 
#       0.1008428       0.4595670 

## t-test assuming sds are identical
t.test(formula = y ~ x, var.equal = TRUE)    

#         Two Sample t-test
# 
# data:  y by x
# t = -0.7519, df = 18, p-value = **0.4618**  ## <<<<--- p values differ between the t-tests
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -1.3610547  0.6436064
# sample estimates:
# mean in group a mean in group b 
#       0.1008428       0.4595670 

## ANOVA
 lm(y ~ x) %>% summary

# Call:
# lm(formula = y ~ x)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.3155 -0.5589  0.1815  0.4773  1.4944 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.1008     0.2385   0.423    0.677
# xb            0.3587     0.4771   0.752    0.462
# 
# Residual standard error: 0.9239 on 18 degrees of freedom
# Multiple R-squared:  0.03045,   Adjusted R-squared:  -0.02341 
# F-statistic: 0.5654 on 1 and 18 DF,  p-value: **0.4618**  ## <<<-- p-value matches that obtained from equal-variance assumption t-test

So, although the textbooks may tell you that the t-test is equivalent to one-way/two-group ANOVA, that is only really true if you assume that the variances are equal in the two groups. And, in particular, whenever you use a statistical test in a computational package, it's really valuable to know what implementation of a test you are actually using.

ps, I think this question fits happily on biostars

ADD COMMENTlink modified 15 months ago • written 15 months ago by russhh2.7k

Thanks so much! This is the answer I am looking for!

ADD REPLYlink written 15 months ago by moushengxu270
2
gravatar for Devon Ryan
15 months ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

You should have used lm(y~x), which will produce similar results. A straight T-test can make other generalizations, regarding things like the variance within groups. There are also other implementation differences under the hood, so it's unsurprising that you get slightly different results. For cases where a T-test is appropriate, it's best to use it.

The the future this would be more appropriate for cross-validated.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Devon Ryan73k

Thanks. Same results with lm & glm.

See russhh's reply for the cause of the difference between glm & t.test.

ADD REPLYlink written 15 months ago by moushengxu270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour