Take-home tests

Do you actually care whether or not we cheat on the stats tests?

Yes, we care a lot. Please don’t cheat. That’s not the right way to start a career as a scientist, guys. Think about who you want to be (hint: the answer is not Dr. Van Datafraud).

Most students experience this sequence as time-consuming, stressful, and confusing, at least some of the time. This is made more difficult by the fact that you’re all grad students, which means pretty much everything else about your lives is also time-consuming, stressful, and confusing – grad stats is just one more straw on the camel’s back. It’s also very important (to us, and hopefully to you all as well) that you learn this material well and are able to take a good foundational knowledge of statistics with you as you launch into your own research and eventual career. Arguably the two most important characteristics of a scientist are the desire to pursue truth and the ability to detect it. Statistics is your number one tool for the latter. So the stakes are high and the task is daunting. What happens if you take a shortcut?

We strongly encourage you to collaborate on the homework assignments. Working together with your peers is an invaluable way to improve your own understanding of the material. The final product you turn in must be your own work, but we would love nothing better than for you to work through it with your classmates, check your answers, ask each other questions, etc.

You are not permitted, however, to collaborate on the take-home exams. We need to assess you as individuals at a couple points throughout each course to check the level of your understanding on your own. This can be a valuable experience for students as well, and often makes students realize which pieces they totally get and which they still haven’t quite grasped – distinctions that might not always be clear in a group-work setting. You may be tempted, however, to collaborate just a little bit on tests, especially if you’re experiencing the time-consuming, stressful, confusing nature of the sequence. Don’t.

But what if you might fail? We are friendly, reasonable people, and we really want you to do well. We work really hard to support students throughout the course, so hopefully you’re never faced with the prospect of failing a test, but even the best intentions don’t always yield the desired results. It’s not unheard of for a student to fail a test, or to fail a class altogether. In this case, we work with you, we try to support you, we figure out what your options are and discuss with you what the best plan will be going forward. We know this sequence is hard, and we know you’re under a lot of pressure. We want to help. If you cheat, you make it impossible for us to help you. More importantly, you earn the derision of your colleagues (for we are definitely your colleagues), and presumably of yourself as well. Cheating on your stats test is absolutely the wrong way to start your career as an ethical researcher. And if you’re worried about failing and maybe having to re-take the class, ask yourself whether you’d rather have a re-take on your transcript, or a re-take and an official notice of academic misconduct. Ask yourself which would be harder to explain to search committees when you’re applying for jobs after graduation.

There is no reason to cheat, and every reason not to. Don’t.

Violations of sphericity?

FAQ: When do we have to worry about a violation of sphericity?

Whenever you run a repeated measures design with more than 2 repeated measures (e.g. measuring a group of participants on the criterion three times each, at Time 1, Time 2, and Time 3), you need to worry about sphericity on all of your within-subjects effects.

So why don’t we need to worry about sphericity if we only have 2 repeated measures (e.g. measuring a group of participants on the criterion at just Time 1 and Time 2)? Remember what this assumption actually refers to: the population variances of the sets of difference scores should be equal. So for three time points, you’ll have three sets of difference scores: T1-T2, T1-T3, and T2-T3. When you only have two time points, there’s only one set of difference scores: T1-T2. It’s impossible to have a violation. The test we run to check this assumption is Mauchly’s W, which is actually technically a test of compound symmetry rather than sphericity (it works out well, though, since whenever we have compound symmetry, we can trust that sphericity is also okay). Mauchly’s compares the covariance matrix for the repeated measures to a smoothed out version of it, where all of the covariances are replaced by the average covariance, and all of the variances (down the diagonal) are replaced by the average variance. So basically, Mauchly’s tests whether all of the variances are equal, and whether all of the covariances are equal. Again, if there are only two repeated measures, it’s impossible to have a violation since there is only one covariance.

Note that this logic applies to contrasts as well as to F tests on main effects. When you run contrasts on the levels of your repeated measure, you’re only comparing two groups at a time (it’s possible that one or both of those groups is made up of observations pooled from other groups, but when you’re testing them it’s been simplified to just two groups). Since there are only two groups, it’s impossible to have a violation of sphericity or compound symmetry, so SPSS doesn’t provide corrections for tests with 1 df (i.e. tests with only two groups), such as contrasts.

Fixed or Random?

Whether we consider a factor fixed or random is more about how the researcher thinks about the factor rather than what the design actually looks like. Let’s say I run a study with 100 participants in 2 conditions. If I want to generalize my results to people other than the particular 100 people I ran in my study, then subjects should be random. But if the 100 people in my study are 100% of the population I’m interested in (e.g. I want to know how members of this psych department behave under condition A vs. condition B, and I run all of us in the experiment), then subjects is fixed. It still looks the same on the surface (100 people in two conditions), but the way we conceptualize the design is different. Similarly, if you test 4 emotions, that could be fixed or random, depending on how you want to generalize your results. If these four emotions are the only ones you want to know about, then they’re fixed. If these are a sample from a larger population of emotions, and you want to generalize to that population, then they’re random. The study looks the same on the surface in both cases (testing people under each of the four emotions included in your study), but the design and the EMS table change.

This has implications for replication attempts. If you wanted to replicate one of my example studies above, the way you would replicate it depends on whether the factors are fixed or random. If you want to replicate my finding about how members of this psychology department perform under condition A vs. condition B, then you should use the same exact people I did again (the members of this psychology department). But if you’re interested in generalizing the results to a larger population (academic psychologists in general, perhaps), then you could and should use a new sample from that population. Similarly, if you want to replicate my findings about how people behave under different emotions, whether you use the exact same 4 emotions I used depends on whether the study is actually about those 4 emotions and no others, or if it’s about emotions in general and I just happened to pick those 4 examples. If you want to generalize to a larger population of emotions, then you could and should draw a new sample of emotions to test.

All that matters is whether you want to generalize to some larger population or not.

Regression module vs. GLM module in SPSS

  ANALYZE  REGRESSION ANALYZE → GENERAL LINEAR MODEL → UNIVARIATE ANOVA
Can you run an ANCOVA? Yes Yes
Can you test the homogeneity of regression coefficients assumption? Yes (use hierarchical regression) Not easily
Can you enter a categorical variable without dummy coding it first? NOOOO!!!! Every time you put a categorical variable that hasn’t been dummy coded into Regression, a kitten dies. Yes. SPSS dummy codes it for you behind the scenes.
Can you get an overall test for the effect of group when there are more than 2 levels? (Rather than just contrasts comparing particular groups) Yes, but it’s a pain (you have to use hierarchical regression, entering all of the dummy codes for your categorical predictor as a step in the model) Yes, it automatically gives you that F-test in the Between-Subjects Effects table.
Can you get estimates of effect size (partial eta-squared)? Not easily Yes, it’s under Options
Can you get collinearity diagnostics (e.g. tolerance)? Yes, it’s under Statistics Not easily
Can you get a plot of the residuals? Yes, under Plots, put ZPRED on the X-axis and ZRESID on the Y-axis Yes, if you save the standardized predicted values and standardized residual values as new variables, and then use the scatterplot function under graphs.
Can you see the adjusted means for each group? No (you can calculate them by hand, though) Yes
Can you get handy plots of the adjusted means? Not easily Yes, specify the plots you want under Plots
Can you test for differences between the adjusted means? Yes, any contrasts you have built into your design are testing differences between adjusted means. Yes, specify the comparisons you want under Contrasts
When I report contrasts, where is the t-statistic? Use the t-test for that comparison from the coefficients table From the contrasts table, the t = contrast estimate / SE
What is the df for the t-test of a contrast? Use the residual df from the Model Summary Use the error df from the Tests of Between-Subjects Effects

When we do ANCOVA examples in lab, we often start by running a hierarchical regression in the Regression module, and then we switch to the GLM module. Why? The results SPSS comes up with are the same either way since the underlying math is identical, but you can get different pieces of output from the two different modules, and some tasks are easier in one module vs. the other. For example, it’s easiest to test the assumption of homogeneity of regression coefficients under Regression, but it’s easiest to get plots of the adjusted means from GLM. It’s not necessary to run it in both, but you may want to because then you can get the best of both worlds output-wise.

Residual Plots

FAQ: How should I interpret a residual plot?

What are you looking for?

  1. Extreme points (outliers)
  2. Uneven variance (heteroscedasticity)
  3. Systematic trends, anything non-random

Outliers

  • Remember that if you have random, independent, normally-distributed residuals (the ideal), you EXPECT some relatively extreme observations.
    • With N=100, roughly how many observations would you expect to fall outside of 2SD from the mean?
    • 2SD is a handy tip, not a definition. It’s a tool, to give you one option for a place to start your outlier analysis.
  • When you remove outliers ask yourself:
    • Are these points extreme relative to the rest of the observations?
    • Are these points influencing my model, such that the regression line doesn’t fit the bulk of the data as well as it could?
    • Are these points just extreme observations of the same basic effect, or do they seem to represent a different underlying process altogether? In effect, can these observations be considered to inform the model conceptually?
  • The rationale for removing outliers is one (or both) of the following:
    • These points are not being driven by the same effect(s) as the other points in the model; there’s clearly a totally different process going on here (e.g. early adversity generally predicts negative outcomes, but some children show resilience – there’s something special about those cases that makes the model not work the same way).
    • These points are so extreme and influential that they are pulling the regression line away from a good fit for the rest of the data; if I leave these points in here, I’m sacrificing prediction accuracy for the bulk of my observations.

Heteroscedasticity

  • Remember that heteroscedasticity is about variance. (It literally means “differing variance” – in Greek “hetero” means “different” and “skedasis” means “dispersion.”)
  • Any reasoning about heteroscedasticity that strays from talking about variance directly is a handy tip, not a definition. For example:
    • Fan shape (this actually refers to range, not variance)
    • Looking for unevenness (this can be influenced by the number of observations, not just variance)
    • “Systematic pattern” in the residuals (this is much too general, and could refer to non-linearity rather than heteroscedasticity)
  • What to look for:
    • You can eyeball variance estimates across your dataset by looking at your residual plot.
      • Why can’t we just calculate variance in Y’ across the predictor(s)?
    • Try to judge what the average residual size is across the residual plot (this is more like the SD than the variance, but whatever).
      • Note that to judge average residual size, you need to take into account how dense the data are (how many observations you have at which values).
    • Use handy tips (e.g. fan shape), but don’t be seduced by them – they don’t work 100% of the time.
    • Be careful about outliers.
      • If you have outliers, they will almost certainly exaggerate the variance at that point.
      • If you have uneven variance driven by a small number of observations, either treat them as outliers (i.e. remove them) and say you don’t have heteroscedasticity, or keep them in the dataset and correct for the uneven variance (i.e. WLS).
  • The rationale for WLS:
    • The presence of heteroscedasticity does NOT necessarily mess up your OLS regression line, but it MIGHT. To get regression coefficients you can feel more confident about, run WLS instead.

Systematic deviations from the regression line (non-randomness)

  • Remember that in the GLM, we assume that our errors (i.e. the residuals) are independent. There should be no systematic variation in your residual plot.
  • If you observe a trend in your residuals, that suggests that your current model is not a good one for these data. Adjust the model (transforming predictors, or adding predictors) and try again.
  • Be careful about outliers.
    • If you have an apparent trend that is driven by a small number of observations (e.g. a handful of points that appear to trail off in a curve, indicating non-linearity), either treat them as outliers (i.e. remove them) and say there’s no systematic deviations from the regression line, or keep them in the dataset and correct the model to account for that systematicity.
    • It’s also possible to see a systematic trend in the residuals because outliers are pulling your regression line away from fitting the bulk of the data. If you remove those outliers, that should correct the problem.

 

Tolerance

If a predictor has high tolerance, that means there’s a strong unique effect of that predictor, right?

No. The test of the regression coefficient is always the test of the predictor’s unique contribution, whether or not there’s multicollinearity. If you have a significant effect for a predictor with low tolerance, that test still indicates what it always does – that the predictor’s unique relationship (taking into account everything else in the model) with the criterion is significantly different from zero. A high tolerance suggests that there is minimal overlap between predictors, so the test of that predictor’s unique effect (the regression coeff) is similar to a test of the bivariate relationship between that predictor and the criterion. That makes the estimate of the effect more stable (changing the model doesn’t change the effect as much, since it’s reflecting the bivariate relationship between the predictor and criterion, which doesn’t depend on the model), and decreases the SE. So having a high tolerance makes it easier to test the unique effect of a predictor, since it’s not being masked by overlap with other variables, but a high tolerance doesn’t actually indicate anything about the strength (or significance) of the relationship between a predictor and the criterion.

How should I interpret significant results for…

…an F test? Are you comparing just 2 means? If so, interpret it just like a t-test: You must interpret the direction of the effect (glance at the group means to see which is higher if you don’t already know). If you have more than 2 means, then the F test tells you there is one or more significant differences somewhere among the means, but does NOT tell you where the significant difference(s) are. If this is a one-way ANOVA, you’ll want to run contrasts to find out. If this is a factorial ANOVA, see information below about main effect, interactions, simple effects, and contrasts.

…a main effect (in factorial ANOVA)? This is just another test, but it compares row means or column means (which average across individual groups) rather than comparing group means (i.e. cell means) individually. Are you comparing only 2 row means, or column means? If so, interpret the direction of the effect. If you are comparing more than 2 row/column means, then the significant main effect tells you there is one or more sig differences somewhere among the row or column means you tested, but does NOT tell you where the significant difference(s) are.

  • Note that since main effects only test combined means (row or column averages), you can’t always say for sure what’s happening with the cell means based on main effects. For example, if you have a 2×3 ANOVA examining gender (M or F) and treatment (A, B or C), if you see a significant main effect of gender such that men score higher than women, that only means that men’s scores averaged across all three treatments are higher than women’s scores averaged across all three treatments. It’s possible that one of the treatments worked quite differently than the other two (maybe women score higher than men in treatment A) – that’s an interaction. Saying simply “there was a significant main effect of gender such that men scored higher than women” is misleading because it’s not true in all cases. If you have a significant main effect and also a significant interaction, you need to interpret the interaction to decide whether or not the main effect is still meaningful.

…an interaction (in factorial ANOVA)? This is also just another F test, testing whether there are any significant differences among the cell means after factoring out row and column effects. Since factorial ANOVAs will always have more than 2 cell means (the simplest factorial ANOVA is 2×2, so you’ll always have at least 4 cell means), you can never interpret the direction of the effect for an interaction without running more tests. The “more tests” you run are typically simple effects tests, and contrasts (if appropriate).

…a simple effects test (in factorial ANOVA)? This is also just an F test. It’s actually a one-way ANOVA, comparing all the cell means in a particular row or column. So, if there are only 2 cell means in that row (or column), then you interpret the direction of your effect and there’s no more work to do (i.e. you won’t run contrasts on those two means – you already know whether they’re different, and if so which is higher. There’s nothing else to learn). If there are more than 2 cell means in that row/column, then the significant simple effects test tells you there is one or more significant differences somewhere among the means, but does NOT tell you where the significant difference(s) are. You’ll want to run contrasts to find out.

… contrasts? A contrast always compares only 2 means, so you always can (and must) interpret the direction of a significant contrast. Sometimes one or more of the means is a mean averaged across multiple groups, which is fine – for example, in a set of Helmert contrasts on treatments A, B, and C, one contrast would compare the average of A and B to C, and the second contrast would compare A and B. If both contrasts were significant, you would interpret the direction for the first by looking at the mean of groups A and B pooled together and the mean of C to see which is higher. You interpret the direction of the second contrast just by looking at the mean of A and the mean of B.

  • Note that in factorial ANOVAs, there are two common situations where you would find yourself wanting to run a contrast: To understand a significant main effect on more than 2 row or column means, or to understand a significant simple effects test on more than 2 cell means.

EXAMPLE TIME! You do a 2×3 ANOVA testing the effect of gender (M or F) and treatment (A, B, or C). Let’s say treatments A, B, and C refer to dosage levels for a new drug (A = low, B = medium, C = high dose).

1.    The factorial ANOVA is significant. You know the cell means are not all the same, but you don’t know how they differ.

2.    You have a significant main effect of gender. Since there are only two levels of gender (M or F), you can interpret the direction of the effect. You examine the mean for men averaged across all three treatments and see that it is higher than the mean for women averaged across all three treatments. You know all you can learn about the mean for men vs. women averaged across all three treatments: they’re significantly different, and men score higher.

3.    You have a significant main effect of treatment. You know the means for each treatment (A, B, and C) averaged across both genders are not all the same, but you don’t know how they differ

4.    You have a significant interaction between gender and treatment. You know that the cell means for each gender-treatment combination (after accounting for row and column effects) are not all the same, but you don’t know how they differ. Importantly, the fact that you know there are still significant differences between some cell means beyond that explained by your main effects of gender and treatment (the row and column effects) makes you hesitant to assume your main effects apply across the board. You need to find out how this interaction works to see whether your main effects still make sense or not.

  • Note that at this point there are still several significant results we can’t fully interpret: the significant main effect of treatment (e.g. we still don’t know which treatment(s) worked the best!), and the interaction between gender and treatment (we know that the effect of treatment depends on gender, but we don’t know how it works). This lingering ambiguity motivates the next tests we run.

5.    First, let’s tackle that main effect of treatment. You want to know how the three column means for treatment (A averaged across both genders, B averaged across both genders, and C averaged across both genders) differ. Since the levels of treatment are meaningfully ordered (low, med, high), polynomial trend contrasts make sense. You get a significant linear contrast, and a non-significant quadratic contrast. You know that low (averaging across both genders) is different from high (averaging across both genders), and since this isn’t qualified by a quadratic trend you know that medium (averaging across both genders) is not significantly different from the average of low and high, suggesting that scores (averaging across both genders) increase as dosage increases, and that scores go up about the same amount for each increase in dose.

Now let’s work on the interaction…

 6.    You have a significant simple effect of treatment at men. This is a one-way ANOVA comparing the means for men who got low dose (A), men who got medium dose (B), and men who got high dose (C). You know that there is one or more significant differences somewhere among the means, but does NOT tell you where the significant difference(s) are.

7.    You have a non-significant simple effect of treatment at women. This is a one-way ANOVA comparing the means for women who got low dose (A), women who got medium dose (B), and women who got high dose (C). You know that there are no significant differences among these means. That suggests that dosage level doesn’t affect women’s scores on this task (i.e. no matter what dosage they got, all of the groups scored about the same). Note that I’m accepting the null hypothesis here, which is sloppy – it’s also quite possible that there are real differences for women who get different dosage levels, but we don’t have a big enough sample here to detect the effect.

8.    You have a significant simple effect of gender at treatment A. Since there are only two levels of gender, there are only two cell means in the low-dose group: men and women. You examine the mean for men in treatment A and see that it is higher than the mean for women in treatment A. You know all you can learn about the mean for men vs. women in treatment A: they’re significantly different, and men score higher.

9.    You have a significant simple effect of gender at treatment B. You examine the mean for men in treatment B and see that it is higher than the mean for women in treatment B.

10. You have a significant simple effect of gender at treatment C. You examine the mean for men in treatment C and see that it is higher than the mean for women in treatment C.

Taking these three simple effect tests together (the simple effects of gender at each level of treatment), we can see that no matter which dosage group you examine, women score significantly lower than men. So now we know that our main effect of gender holds across all of the treatments.

  • Note that at this point we can interpret our main effects, but we’re not done with the interaction. We understand the direction of both of our main effects and we know that the main effect of gender is true across all the treatments. The interaction is messing with our main effect of treatment, though: we know treatment works differently in men and women because there is a significant effect of treatment within the men, but there isn’t a significant effect of treatment in women. That means the contrast we ran showing there’s a significant linear trend in treatment collapsing across genders (part 5) is no good. We’re not done, though, because we still can’t fully interpret the interaction – we still don’t know how treatment works in the men, just that there are differences between the treatments A, B and C in men. Use contrasts to find out.

11. You run polynomial trend contrasts on treatment within men. You get a significant linear contrast, and a non-significant quadratic contrast. You know that, for men, low dose is different from high dose, and since this isn’t qualified by a quadratic trend you know that medium dose is not significantly different from the average of low and high, suggesting that scores for men increase as dosage increases, and that scores go up about the same amount for each increase in dose.

  • Okay. Now we’ve ironed out all the details. Here’s what we know: There is a significant effect of gender such that men score higher than women, and a significant effect of treatment which is qualified by an interaction between gender and treatment. There is not a simple effect of treatment within women, but there is within men such that scores increase linearly as dosage increases. Basically, it appears that men react to dosage such that the higher dose they get, their higher scores, whereas women always score lower than men and appear not to be affected by dosage. Ta-da!

Low Power

What does it mean to have low power?

A couple important things to keep in mind:

  • Power is not the same as effect size. Effect size is relevant to a particular data set or population(s) – you collect the data, then you analyze that specific data set to see how big the effect is in that sample. You can hypothesize about the “true” population effect size based on your sample results, or based on theory or prior work. Power is relevant to a particular experimental design – the specifics of your experiment, but NOT the particular dataset.
  • Power is a probability, just like β and α. In null hypothesis testing, there are two possible realities for any given test – either the null hypothesis really is true, or the null hypothesis is false. Because we can’t measure anything with perfect accuracy, there’s an element of randomness in our testing of that reality, though, which means that if the null hypothesis is true, there’s a chance we’ll correctly retain it, but there’s also a chance we’ll incorrectly reject it just because we happen to get a weird sample. Similarly, if the null hypothesis is actually wrong, there’s a chance we’ll reject it, but there’s also a chance we’ll incorrectly retain it. β is the probability of retaining the null hypothesis when you should reject, i.e. getting a p > .05 (not significant) when there really is an effect in the population. α is the probability of incorrectly rejecting the null when you should retain it, i.e. getting p < .05 (significant) when there really is NOT an effect in the population. Power is the probability of rejecting the null when that’s the correct thing to do; i.e. getting p < .05 (significant) when there really is an effect in the population.
  • What influences the probability of rejecting the null, assuming that it really is wrong (power)? Effect size, sample size, and whatever you set α to be (more stringent α means lower power). Make sure you understand WHY each of these things affects power. Review Lab 6 for help.

When you calculate power, you get a number between 0 and 1 because it’s a probability: 0 means it’s impossible, 1 means it’s definitely going to happen. Remember that this number is based on the assumption that the effect is real (you enter a hypothesized effect size when you calculate power –if the null hypothesis were true, that effect size would need to be entered as zero). So let’s say you want to run a study with 20 participants, and you anticipate the effect size will be .5, so you calculate that you have a power of .36. That means that if you run this study with N=20 an infinite number of times and the true effect size in the population is really .5, then 36% of the time you’ll get a significant result. But real researchers don’t run the same study over and over an infinite number of times, they run it once. The practical implication of low power, as in the example here, is that if you run your study as proposed, it’s pretty unlikely you’ll get a significant result, even if the effect is real. As a rule of thumb, you shouldn’t run a study unless you have a power of at least 80%. Otherwise it’s not really worth your time.

 

If you know a test was under-powered, how does that change how you interpret a significant result?

So what if you know your power is low, but you run it and you get a significant result anyway? Or what if you don’t conduct power analyses until after you’ve already run the study and analyzed the data? If you had low power and got a significant result, one of two things happened:

  • This result reflects reality (i.e. the effect is real in the population), and you just got lucky with your random samples. Even though there was only a 36% chance of randomly getting samples that would result in a significant effect, you happened to fall into that 36%. The effect size you measure in your dataset will almost certainly be an over-estimate of the true effect size in the population.
  • This is a Type 1 error. Usually, we can assume that the probability of getting a Type 1 error is quite low (5%, or 1 out of 20 tests run). Unfortunately, there are lots of things that can change your Type 1 error rate without you meaning to, and many of them are easy to do accidentally. Type 1 error rate can climb over 50% for some types of problems!

When you get a significant result for a low-power study, you should feel a little uneasy. Basically, you’re faced with accepting that one of two low-probability events occurred, or that there’s something going on that’s messing with your results (such as something inflating your Type 1 error rate). When low-powered studies successfully replicate, that’s especially concerning. Let’s say you ran two follow-up studies with similar design, and replicated your original finding in each of them (or let’s say you’re reading an article where the author claims to have done this). The probability of getting three significant results from three studies with this design is .36*.36*.36= .045, or 4.5% (compared to getting three Type 1 errors in a row when α is .05, which would be .01%). Are you willing to believe that happened? Probably not. What’s much more likely is that something is happening that’s inflating your Type 1 error rate, so it’s not really .05, but actually much higher. One very common thing that can do this to you is multiple comparisons (when you run lots of tests but then just report the significant ones, acting as though they were theoretically motivated). Another common mistake that drastically inflates Type 1 error rate is treating random effects as fixed – we’ll talk about this in 612. And, of course, some bad scientists just doctor their data. That leads to a pretty high Type 1 error rate, unsurprisingly.

The moral of the story is that if you see significant results from a low-power study design, you should evaluate that result very critically, whether it’s your own work or you’re reading someone else’s report of their study. There’s always the risk that Type 1 error rate was actually a lot higher than the researcher reported (.05), and that’s why they reached significance.

What’s the difference between all the different kinds of orthogonal contrasts (Helmert, polynomial trend, etc.)?

There is no difference, really! For J group means, you can create J-1 orthogonal contrasts, but the particular contrasts that would be theoretically motivated will differ study to study. For some reason (vanity?) people started naming a couple common sets of orthogonal contrasts. There’s nothing fundamentally different about conducting Helmert contrasts vs. polynomial trend contrasts vs. some other set of orthogonal contrasts you invent. They’re all just contrast weights applied to group means. When you have a set of means you want to conduct contrasts on, just think about which comparisons would make sense theoretically and figure out a way you can elegantly get that information. When possible, you should construct orthogonal contrasts (but don’t stop yourself from testing an important question if it would mean non-orthogonal contrasts – orthogonality is good, but not vital). Maybe you’ll end up with a set of contrasts that has been named by somebody, maybe you won’t. It completely doesn’t matter. Just pay attention to your contrast weights and you’ll be able to interpret your results just fine.

For an excellent description of lots of different coding schemes – and all of the relevant code for using them in R! – see the contrast coding explanation from IDRE. Although beware: the contrasts() command is not as straightforward as the lovely folks at IDRE make it out to be. If you’re interested in trying this in R, be sure to also read my rpubs page on contrasts.

Also see this post by Nicole.

When should I do post hoc comparisons?

It depends on the motivation for the comparison. If the reason you want to compare two means is NOT based on what you saw in the data you collected (e.g. because your theory suggests they should be different), then that should be an a priori comparison. An example of this would be the contrasts we run in the factorial ANOVA example below – the polynomial trend contrasts are based on the assumption that low, medium, and high dosage (which is a naturally ordered categorical variable) will affect scores in an ordered way. On the other hand, if the reason you want to compare two means is because they look like they might be different (you look at the data, and then decide which comparisons to run based on what looks promising), then you should be doing post hoc tests. Post hoc tests feel a little different to run than a priori contrasts: generally, post hoc tests analyze ALL of the possible comparisons rather than picking out just a few. For example, Tukey’s HSD is a post hoc test. To run it, you correct your significance level (you make it more stringent, to correct for the fact that your Type 1 error rate would otherwise be inflated because you’re doing so many comparisons), and then check every pairwise comparison in the dataset to see which ones reach significance.

Also see this post by Nicole.