residuals

FAQ: How should I interpret a residual plot?

What are you looking for?

Extreme points (outliers)
Uneven variance (heteroscedasticity)
Systematic trends, anything non-random

Outliers

Remember that if you have random, independent, normally-distributed residuals (the ideal), you EXPECT some relatively extreme observations.
- With N=100, roughly how many observations would you expect to fall outside of 2SD from the mean?
- 2SD is a handy tip, not a definition. It’s a tool, to give you one option for a place to start your outlier analysis.
When you remove outliers ask yourself:
- Are these points extreme relative to the rest of the observations?
- Are these points influencing my model, such that the regression line doesn’t fit the bulk of the data as well as it could?
- Are these points just extreme observations of the same basic effect, or do they seem to represent a different underlying process altogether? In effect, can these observations be considered to inform the model conceptually?
The rationale for removing outliers is one (or both) of the following:
- These points are not being driven by the same effect(s) as the other points in the model; there’s clearly a totally different process going on here (e.g. early adversity generally predicts negative outcomes, but some children show resilience – there’s something special about those cases that makes the model not work the same way).
- These points are so extreme and influential that they are pulling the regression line away from a good fit for the rest of the data; if I leave these points in here, I’m sacrificing prediction accuracy for the bulk of my observations.

Heteroscedasticity

Remember that heteroscedasticity is about variance. (It literally means “differing variance” – in Greek “hetero” means “different” and “skedasis” means “dispersion.”)
Any reasoning about heteroscedasticity that strays from talking about variance directly is a handy tip, not a definition. For example:
- Fan shape (this actually refers to range, not variance)
- Looking for unevenness (this can be influenced by the number of observations, not just variance)
- “Systematic pattern” in the residuals (this is much too general, and could refer to non-linearity rather than heteroscedasticity)
What to look for:
- You can eyeball variance estimates across your dataset by looking at your residual plot.
  - Why can’t we just calculate variance in Y’ across the predictor(s)?
- Try to judge what the average residual size is across the residual plot (this is more like the SD than the variance, but whatever).
  - Note that to judge average residual size, you need to take into account how dense the data are (how many observations you have at which values).
- Use handy tips (e.g. fan shape), but don’t be seduced by them – they don’t work 100% of the time.
- Be careful about outliers.
  - If you have outliers, they will almost certainly exaggerate the variance at that point.
  - If you have uneven variance driven by a small number of observations, either treat them as outliers (i.e. remove them) and say you don’t have heteroscedasticity, or keep them in the dataset and correct for the uneven variance (i.e. WLS).
The rationale for WLS:
- The presence of heteroscedasticity does NOT necessarily mess up your OLS regression line, but it MIGHT. To get regression coefficients you can feel more confident about, run WLS instead.

Systematic deviations from the regression line (non-randomness)

Remember that in the GLM, we assume that our errors (i.e. the residuals) are independent. There should be no systematic variation in your residual plot.
If you observe a trend in your residuals, that suggests that your current model is not a good one for these data. Adjust the model (transforming predictors, or adding predictors) and try again.
Be careful about outliers.
- If you have an apparent trend that is driven by a small number of observations (e.g. a handful of points that appear to trail off in a curve, indicating non-linearity), either treat them as outliers (i.e. remove them) and say there’s no systematic deviations from the regression line, or keep them in the dataset and correct the model to account for that systematicity.
- It’s also possible to see a systematic trend in the residuals because outliers are pulling your regression line away from fitting the bulk of the data. If you remove those outliers, that should correct the problem.

Linearly Independent Rose

Residual Plots

FAQ: How should I interpret a residual plot?

What are you looking for?

Outliers

Heteroscedasticity

Systematic deviations from the regression line (non-randomness)

What is a residual?