Residual Plots

FAQ: How should I interpret a residual plot?

What are you looking for?

  1. Extreme points (outliers)
  2. Uneven variance (heteroscedasticity)
  3. Systematic trends, anything non-random

Outliers

  • Remember that if you have random, independent, normally-distributed residuals (the ideal), you EXPECT some relatively extreme observations.
    • With N=100, roughly how many observations would you expect to fall outside of 2SD from the mean?
    • 2SD is a handy tip, not a definition. It’s a tool, to give you one option for a place to start your outlier analysis.
  • When you remove outliers ask yourself:
    • Are these points extreme relative to the rest of the observations?
    • Are these points influencing my model, such that the regression line doesn’t fit the bulk of the data as well as it could?
    • Are these points just extreme observations of the same basic effect, or do they seem to represent a different underlying process altogether? In effect, can these observations be considered to inform the model conceptually?
  • The rationale for removing outliers is one (or both) of the following:
    • These points are not being driven by the same effect(s) as the other points in the model; there’s clearly a totally different process going on here (e.g. early adversity generally predicts negative outcomes, but some children show resilience – there’s something special about those cases that makes the model not work the same way).
    • These points are so extreme and influential that they are pulling the regression line away from a good fit for the rest of the data; if I leave these points in here, I’m sacrificing prediction accuracy for the bulk of my observations.

Heteroscedasticity

  • Remember that heteroscedasticity is about variance. (It literally means “differing variance” – in Greek “hetero” means “different” and “skedasis” means “dispersion.”)
  • Any reasoning about heteroscedasticity that strays from talking about variance directly is a handy tip, not a definition. For example:
    • Fan shape (this actually refers to range, not variance)
    • Looking for unevenness (this can be influenced by the number of observations, not just variance)
    • “Systematic pattern” in the residuals (this is much too general, and could refer to non-linearity rather than heteroscedasticity)
  • What to look for:
    • You can eyeball variance estimates across your dataset by looking at your residual plot.
      • Why can’t we just calculate variance in Y’ across the predictor(s)?
    • Try to judge what the average residual size is across the residual plot (this is more like the SD than the variance, but whatever).
      • Note that to judge average residual size, you need to take into account how dense the data are (how many observations you have at which values).
    • Use handy tips (e.g. fan shape), but don’t be seduced by them – they don’t work 100% of the time.
    • Be careful about outliers.
      • If you have outliers, they will almost certainly exaggerate the variance at that point.
      • If you have uneven variance driven by a small number of observations, either treat them as outliers (i.e. remove them) and say you don’t have heteroscedasticity, or keep them in the dataset and correct for the uneven variance (i.e. WLS).
  • The rationale for WLS:
    • The presence of heteroscedasticity does NOT necessarily mess up your OLS regression line, but it MIGHT. To get regression coefficients you can feel more confident about, run WLS instead.

Systematic deviations from the regression line (non-randomness)

  • Remember that in the GLM, we assume that our errors (i.e. the residuals) are independent. There should be no systematic variation in your residual plot.
  • If you observe a trend in your residuals, that suggests that your current model is not a good one for these data. Adjust the model (transforming predictors, or adding predictors) and try again.
  • Be careful about outliers.
    • If you have an apparent trend that is driven by a small number of observations (e.g. a handful of points that appear to trail off in a curve, indicating non-linearity), either treat them as outliers (i.e. remove them) and say there’s no systematic deviations from the regression line, or keep them in the dataset and correct the model to account for that systematicity.
    • It’s also possible to see a systematic trend in the residuals because outliers are pulling your regression line away from fitting the bulk of the data. If you remove those outliers, that should correct the problem.

 

What is a residual?

What is a residual?

A residual is the difference between the predicted score and the actual score for one observation (usually one participant). If you’re using ordinary least squares regression (which includes ANOVAs, t-tests, typical regression, etc. – everything you do in a typical applied stats class) the average of all of the residuals will be zero. That’s why sometimes when we’re talking about stuff that applies to the whole dataset, we leave off the residual term altogether. For any individual observation, you can calculate the residual by getting the score your model would have predicted for someone in that situation (e.g. someone in that cell in your factorial ANOVA, or someone with that predictor score if you’re doing a regression), and then subtract that from their actual measured score.