Fixed or Random?

Whether we consider a factor fixed or random is more about how the researcher thinks about the factor rather than what the design actually looks like. Let’s say I run a study with 100 participants in 2 conditions. If I want to generalize my results to people other than the particular 100 people I ran in my study, then subjects should be random. But if the 100 people in my study are 100% of the population I’m interested in (e.g. I want to know how members of this psych department behave under condition A vs. condition B, and I run all of us in the experiment), then subjects is fixed. It still looks the same on the surface (100 people in two conditions), but the way we conceptualize the design is different. Similarly, if you test 4 emotions, that could be fixed or random, depending on how you want to generalize your results. If these four emotions are the only ones you want to know about, then they’re fixed. If these are a sample from a larger population of emotions, and you want to generalize to that population, then they’re random. The study looks the same on the surface in both cases (testing people under each of the four emotions included in your study), but the design and the EMS table change.

This has implications for replication attempts. If you wanted to replicate one of my example studies above, the way you would replicate it depends on whether the factors are fixed or random. If you want to replicate my finding about how members of this psychology department perform under condition A vs. condition B, then you should use the same exact people I did again (the members of this psychology department). But if you’re interested in generalizing the results to a larger population (academic psychologists in general, perhaps), then you could and should use a new sample from that population. Similarly, if you want to replicate my findings about how people behave under different emotions, whether you use the exact same 4 emotions I used depends on whether the study is actually about those 4 emotions and no others, or if it’s about emotions in general and I just happened to pick those 4 examples. If you want to generalize to a larger population of emotions, then you could and should draw a new sample of emotions to test.

All that matters is whether you want to generalize to some larger population or not.

Low Power

What does it mean to have low power?

A couple important things to keep in mind:

  • Power is not the same as effect size. Effect size is relevant to a particular data set or population(s) – you collect the data, then you analyze that specific data set to see how big the effect is in that sample. You can hypothesize about the “true” population effect size based on your sample results, or based on theory or prior work. Power is relevant to a particular experimental design – the specifics of your experiment, but NOT the particular dataset.
  • Power is a probability, just like β and α. In null hypothesis testing, there are two possible realities for any given test – either the null hypothesis really is true, or the null hypothesis is false. Because we can’t measure anything with perfect accuracy, there’s an element of randomness in our testing of that reality, though, which means that if the null hypothesis is true, there’s a chance we’ll correctly retain it, but there’s also a chance we’ll incorrectly reject it just because we happen to get a weird sample. Similarly, if the null hypothesis is actually wrong, there’s a chance we’ll reject it, but there’s also a chance we’ll incorrectly retain it. β is the probability of retaining the null hypothesis when you should reject, i.e. getting a p > .05 (not significant) when there really is an effect in the population. α is the probability of incorrectly rejecting the null when you should retain it, i.e. getting p < .05 (significant) when there really is NOT an effect in the population. Power is the probability of rejecting the null when that’s the correct thing to do; i.e. getting p < .05 (significant) when there really is an effect in the population.
  • What influences the probability of rejecting the null, assuming that it really is wrong (power)? Effect size, sample size, and whatever you set α to be (more stringent α means lower power). Make sure you understand WHY each of these things affects power. Review Lab 6 for help.

When you calculate power, you get a number between 0 and 1 because it’s a probability: 0 means it’s impossible, 1 means it’s definitely going to happen. Remember that this number is based on the assumption that the effect is real (you enter a hypothesized effect size when you calculate power –if the null hypothesis were true, that effect size would need to be entered as zero). So let’s say you want to run a study with 20 participants, and you anticipate the effect size will be .5, so you calculate that you have a power of .36. That means that if you run this study with N=20 an infinite number of times and the true effect size in the population is really .5, then 36% of the time you’ll get a significant result. But real researchers don’t run the same study over and over an infinite number of times, they run it once. The practical implication of low power, as in the example here, is that if you run your study as proposed, it’s pretty unlikely you’ll get a significant result, even if the effect is real. As a rule of thumb, you shouldn’t run a study unless you have a power of at least 80%. Otherwise it’s not really worth your time.

 

If you know a test was under-powered, how does that change how you interpret a significant result?

So what if you know your power is low, but you run it and you get a significant result anyway? Or what if you don’t conduct power analyses until after you’ve already run the study and analyzed the data? If you had low power and got a significant result, one of two things happened:

  • This result reflects reality (i.e. the effect is real in the population), and you just got lucky with your random samples. Even though there was only a 36% chance of randomly getting samples that would result in a significant effect, you happened to fall into that 36%. The effect size you measure in your dataset will almost certainly be an over-estimate of the true effect size in the population.
  • This is a Type 1 error. Usually, we can assume that the probability of getting a Type 1 error is quite low (5%, or 1 out of 20 tests run). Unfortunately, there are lots of things that can change your Type 1 error rate without you meaning to, and many of them are easy to do accidentally. Type 1 error rate can climb over 50% for some types of problems!

When you get a significant result for a low-power study, you should feel a little uneasy. Basically, you’re faced with accepting that one of two low-probability events occurred, or that there’s something going on that’s messing with your results (such as something inflating your Type 1 error rate). When low-powered studies successfully replicate, that’s especially concerning. Let’s say you ran two follow-up studies with similar design, and replicated your original finding in each of them (or let’s say you’re reading an article where the author claims to have done this). The probability of getting three significant results from three studies with this design is .36*.36*.36= .045, or 4.5% (compared to getting three Type 1 errors in a row when α is .05, which would be .01%). Are you willing to believe that happened? Probably not. What’s much more likely is that something is happening that’s inflating your Type 1 error rate, so it’s not really .05, but actually much higher. One very common thing that can do this to you is multiple comparisons (when you run lots of tests but then just report the significant ones, acting as though they were theoretically motivated). Another common mistake that drastically inflates Type 1 error rate is treating random effects as fixed – we’ll talk about this in 612. And, of course, some bad scientists just doctor their data. That leads to a pretty high Type 1 error rate, unsurprisingly.

The moral of the story is that if you see significant results from a low-power study design, you should evaluate that result very critically, whether it’s your own work or you’re reading someone else’s report of their study. There’s always the risk that Type 1 error rate was actually a lot higher than the researcher reported (.05), and that’s why they reached significance.