Question

Regression assumptions - fitted and residual plots

0

Entering edit mode

2.9 years ago

Elizabeth • 0

I have fitted a linear regression with several covariates and am now checking the assumptions. I have created a plot of the fitted/predicted values against the residuals. However, the data points do not appear to be randomly dispersed. Instead, some points appear to form diagonal lines (see image below).

Please could someone explain what might be going on here and whether regression assumptions may have been violated?

Thank you!

Scatter plot of fitted/predicted values on the x-axis and standardised residuals on the y-axis. The data points are not evenly distributed on the plot. Instead, some data points appear to form neat lines

assumptions regression • 2.6k views

ADD COMMENT • link updated 2.9 years ago by Student ▴ 30 • written 2.9 years ago by Elizabeth • 0

0

Entering edit mode

you need to produce other diagnostic plots. use the commend plot(your_linear_model(specification_and_data)) there will be 4 plots produced

ADD REPLY • link 2.9 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Show your input data. Are they integers instead of continuous numbers?

ADD REPLY • link 2.9 years ago by Papyrus ★ 2.9k

0

Entering edit mode

Thanks for your reply. Input data are the average (mean) of three integers. For example, a participant could have three scores of 5,5 and 6 - the mean of the three scores is the input data for each participant.

ADD REPLY • link 2.9 years ago by Elizabeth • 0

0

Entering edit mode

So do you have many repeated values across your observations? And is this for the Y variable or the X variable? are both the variables this type of data? I think the plot could be related to having repeated/non-continuous values, because you have a low number of possible values. For example, I can generate a similar plot by fitting integers:

data1 <- as.integer(rnorm(300, mean = 50, sd = 10))
data2 <- as.integer(-2*data1 + rnorm(300,0,5))
fit <- lm(data1 ~ data2)
fit.stdres <- rstandard(fit)

plot(data2, fit.stdres, xlab = "x", ylab = "std. res")
abline(0,0,col="red")

EDIT: I just found a discussion on a similar issue here.

ADD REPLY • link 2.9 years ago by Papyrus ★ 2.9k

0

Entering edit mode

Thanks for the link and help above - this was really helpful!

To answer your questions, I have (up to) three repeat values for each observation, but I have created a mean of these so it is as one value per observation. This is for the outcome variable. The predictor and covariates have only one value for each observation, no mean was created.

I think you are right about the parallel lines on the plot reflecting the fact that the outcome variable is non-continuous and can only take low number of possible values.

In a textbook (Andy Field Discovering Statistics) I have seen that you can do a robust regression if assumptions are violated - that is a regression with bootstrapping. Would a robust regression (i.e. bootstrapping) be useful here?

ADD REPLY • link 2.9 years ago by Elizabeth • 0

0

Entering edit mode

I have little experience with the type of data you describe, maybe others can chime in.

ADD REPLY • link 2.9 years ago by Papyrus ★ 2.9k

score 0 · Answer 1 · 2021-05-21

0

Entering edit mode

2.9 years ago

Student ▴ 30

The assumptions of a linear regression model are that the residuals are distributed with a Gaussian distribution and that they have equal variance.

Therefore, you have to check for the homogeneity of variance of residuals: if you are doing it with R, you can do a plot of the residuals as function of fitted values (as you have done, I think) and so check firstly by eye if there are dependencies between residuals and fitted values and then also you could do a Levene's test (that is a specific test for the homogeneity of variance).

Finally, for the other assumption, you can do a Quantile-Quantile plot, that is a plot of empirical quantiles (of the distribution relative to the residuals, of course) as function of theoretical quantiles (that are relative to a Gaussian distribution). If the residuals are gaussianly distributed, the data points of such plot will be on the bisector.

You can obtain both the plots by doing plot(your_linearmodel).

In my opinion, all these things are well described here. In the case of your plot, it seems that the residuals have a sort of dependence with the fitted values, so the first assumption that I mentioned could be violated. But I suggest to do plot(your_linearmodel), as said before and as the others also suggest.

ADD COMMENT • link 2.9 years ago by Student ▴ 30

0

Entering edit mode

Thanks Manuela. Is there a solution if the assumption of homogeneity of variance is violated for regression?

I have seen one textbook recommending the use of bootstrapping when assumptions are violated. I have also seen a paper that reports calculating "heteroscedasticity consistent standard errors".

ADD REPLY • link 2.9 years ago by Elizabeth • 0

0

Entering edit mode

To be honest, I also do not have very much experience in this cases. In this website you can find a little hint . For the paper, I did not study these things but actually in Wikipedia you can find a list of softwares that you can use to have "consistent estimation of the covariance matrix of the coefficient estimates in regression models" in case of heteroscedasticity. But I do not know really more, I am sorry :/

ADD REPLY • link 2.9 years ago by Student ▴ 30

0

Entering edit mode

You may transform your response. Try log-transform Y or square root transform it.

ADD REPLY • link 2.9 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

yes because as it tells

In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth. Some combination of logging and/or deflating will often stabilize the variance in this case.

But I think that the website is clear and all interesting for this post :)

ADD REPLY • link 2.9 years ago by Student ▴ 30