Grading is ineffective, harmful, and unjust — let’s stop doing it

— by Amy J. Ko, Medium, March 16, 2019

This has been a fraught month of academic drama in my life. My daughter is anxiously awaiting her college admissions decisions, while she questions to what extent they’ll be based on merit. Next week is finals week here at the University of Washington, which basically means tens of thousands of students will face extreme exhaustion and anxiety. And the recent news of the college admissions bribery scandal exposed elite fraud behind high stakes exams for college admissions, creating a roiling discomfort amongst in my academic communities about the underlying myths in our supposed meritocracies. And as I reflect on all of these high stakes decisions — who gets into college, who gets which jobs, and how much these decisions depend on people’s perform on exams — I can’t help but reflect on my lifelong distaste for society’s obsession with grading.

I think grading is terrible. But it’s not just because society is obsessed with it or that it causes stress. There are much deeper, fundamental reasons why I think not only exams, but all forms of summative assessment, are destructive, ineffective, and highly problematic from an equity perspective. And as we’re now seeing with the extreme lengths that parents go to in order to secure coveted positions in elite schools, grading is also warping society in a way that I believe is masking the real purpose of school, which is to learn, to grow, and to discover one’s self. And so, in this essay, I’d like to deconstruct the core flaws of grading, and sketch a vision for a different path for society to allocate resources like admissions and hiring.

First, let me distinguish between formative and summative assessment. Formative assessment is diagnostic: it helps learners and teachers understand what learners do and don’t know, can and can’t do, to plan future learning. Formative assessment is well known to be a powerful driver of learning when done well, because it helps both students and learners know their strengths and weaknesses, and decide what to do next to improve on them. Summative assessment on the other hand is intended to be a formal measurement of knowledge that is for the purpose of making decisions. We grade to help schools make admissions and graduation decisions, to help employers make hiring decisions, and to help people like scholarship donors and loan grantors interpret merit and academic progress.

So why is summative assessment so bad relative to formative assessments? Let me share the ways.

First, we — teachers at all levels as well as hiring managers and recuriters trying to make decisions about individuals — do summative assessment poorly. Really poorly. Designing reliable, valid assessments of any knowledge is exceptionally difficult. Organizations like the College Board, which constructs assessments like the SAT and AP Exams, have an entire staff of people carefully measuring the psychometric properties of exams over years and across massive populations of learners. Many of these staff members are exceptionally well-trained experts with Ph.D.s in educational measurement (many of whom graduate from UW’s College of Education). And despite their careful work, assessments still have profound limitations on their validity and use. For instance, getting a “4” on the AP CS A exam really only means one thing: a student likely knows the content of a typical introductory programming course in a higher education CS class. That’s all it means. And yet, just like with grades, most people just see it as a number and read whatever meaning they want into it, such as their future abilities as a software engineer, their likely success in a CS program, their intelligence. And yet it means none of these things.

Now consider the vastly less carefully constructed assessments that K-12 teachers, higher education faculty like myself, and recruiters create. Teachers might spend a weekend writing a midterm, a few hours writing a quiz, and most of us aren’t even trained to construct reliable, equitable, valid assessments of knowledge. What are the chances our summative assessments have any meaning? Pretty low. I have even less confidence that recruiters and hiring managers can make reasonable judgements of ability. Consider, for example, the still ubiquitous practice of using coding puzzles in software engineering interviews. Everyone does it, even though they have a strong intuition that it doesn’t really predict anything. And their intuition is right, at least at Google.

Suppose all of these groups trying to measure students’ abilities got outstanding training in designing summative assessments, and had infinite time to create great assessments. Even if this were the case (which I’m confident will never be), summative assessments themselves are used by many different groups in ways that are inappropriate, unproductive, and even harmful to students and society.

Consider, for example, how students use grades. They see them as a reflection of their abilities, and sometimes more deeply as a reflection of their ability to learn at all. Grades shape their self-efficacy in particular skills (“I’m not good at math”), but they also shape learners beliefs about their intelligence (“I’m not good at learning”), and eventually, they can begin to shape their identities (“I’m not a very smart person”). And as we know from decades of research on learning, all of these beliefs shape students’ future success at learning, all but guaranteeing that the first grades students see in their early days at school will determine much of their future performance. These are, of course, entirely inappropriate uses of grades. None of the measures that teachers construct or recruiters use to make hiring decisions mean any of these things, but students interpret them that way anyway. This is a real form of harm.

K-12 schools, colleges, and universities also use grades inappropriately. The very notion of a report card or a transcript as an archival record of learning is entirely counter to what we know about how knowledge develops over time. First of all, knowledge develops. It changes, shifts, expands, contracts, and decays. The fact that I got a B+ in government in high school says nothing about my interest or knowledge in politics now 23 years later — I’m so obsessed with government and politics, I could probably teach a great high school class on government! All that grade really says is I had a chatty friend who sat next to me in class and a bad teacher. And that same B+ my friend got also means nothing, because his lifelong disinterest in government and politics probably means he’s forgotten most of what he learned. But academic archiving of grades as a record of knowledge signals meanings that aren’t there: they convey a sense of one’s abilities now and they suggest a validity and credibility of measurement that is usually not the case. Students know that their transcripts aren’t accurate — they write in elaborate detail in admissions applications about the extenuating circumstances that shaped their grades, and the later learning that makes them invalid representations of their knowledge.

Schools also do statistically ridiculous things with grades. For example, grade point averages take a bunch of aggregate measures of learning (every homework, test, quiz, etc.) and aggregate them even further, as if any of those aggregate measures had any shared meaning. Remember units in math? How meaningful is it to add a bunch of quizzes that measured entirely different things into a big sum, then average that sum with a bunch of other sums that have different units? It’s not really meaningful at all.

If anything, grade point averages mostly measure how well a student can do “school,” and how many resources the student has to successfully perform these school skills. Is ability to “do school” a good predictor of anything? It’s a good predictor of salary if you’re white and wealthy, but not if you’re a racial minority, which says a lot more about racism than it does the predictive power of a grade point average.

Finally, in addition to the poor practices we use to determine grades, and the invalid and inappropriate ways that teachers, schools, employers, and students use grades, grades are also a form of structural barrier to success in society. They mask, and sometimes erase, entire lifetimes of inequities that end up explaining most of the variation grades. Let’s enumerate a few of the factors that grades inadvertently measure other than a learner’s knowledge:

Grades reflect how much students sleep, because sleep determines students’ ability to attend to learning and perform during tests.
Grades reflect the extent to which students are getting enough food and enough healthy food, because that determines their attention.
Grades reflect students interest in a topic, because interest determines motivation to perform well on summative assessments.
Grades reflect students’ belief in their ability to learn a skill, which are shaped by culture, gender, racial norms, as well as socioeconomic status and the social networks that come with it.
Grades reflect students’ ability to manage their time, which is not something that all students have the opportunity to learn or the same capacity to learn, due to neurodiversity in executive function.
Grades reflect students’ ability to perform in high stress testing contexts, which not all students can do equally because of test anxiety.
Grades reflect students’ language fluency, as fluency is key to comprehending prompts on many assessments.
Grades reflect students’ cultural knowledge, as assessments often require such knowledge to comprehend a question.
Grades often reflect students’ ability to attend class on time, especially in higher education, where many students must find ways to commute reliability in inherently unreliable cars and public transit.
Grades reflect how much time students have to devote to learning, which itself is a product of how much time they spend commuting, managing their health and wellness, caring for family members, caring for children, and managing external forces of conflict and chaos, such as violence, trauma, and domestic abuse.

None of these underlying factors in shaping grades have anything to do with what learners’ abilities, and yet we pretend as if none of these factors are at play when we interpret grades. This is the despite the fact that when we were in school, we were intimately aware that the grades we received were mostly a reflection of all of the above, and not our knowledge. If we set out to measure these and the many other factors in people’s lives that shape learning, I suspect we’d find that high grades on well-designed assessments are reasonable signals of student knowledge, but lower grades are just a measure of all of these other factors (and don’t necessarily mean that a student lacks the knowledge).

If we could ensure none of these factors were at play in our assessments of student knowledge, and we could ensure that the world used grades in appropriate ways, and we could ensure that our assessments were actually reliable and internally valid, then I might be okay with society being obsessed with grades. But that seems a bit utopian to me, and for problem that I’m not sure is actually that important to solve.

So let’s talk about the problems that summative assessments seek to actually solve. Here are a few things they help with, and why they don’t help that much:

Grades are an incentive. Indeed they are. I use grades to extrinsically motivate students to learn. And yet they are incentive that every teacher knows warps learning; students become more concerned with maximizing a number than they do with the learning itself. Is the incentive really worth it, especially when we could be using other more powerful incentives for learning, such as students’ intrinsic interests and life goals? I don’t think so. When I design classes, I try to design assessements that are closely aligned with student interests and goals to avoid this warping of motivations, but the effect is still there.
Grades accelerate merit decisions. Some might complain that without grades, trying to decide who gets into college, who gets an internship, who gets a scholarship, or who gets a job, would take too long. But is that really true? I run an admissions process that doesn’t use GPA at all, and only relies on grades for only one of many criteria, and we manage to get it done. Sure, we can’t automatically make admissions decisions through a formula, but for the reasons noted above, that would be highly unjust anyway.

So if grades are just a poor motivator and a problematic decision accelerator, what should we use to assess students’ knowledge instead of grades? Well, outside of schools, there are many other indicators that we already use in society that might be reasonable predictors of future performance. Brief semi-structured interviews can be a great way to assess particular skills (Harvard does this for nearly all their alumni using their massive alumni population). Portfolios can be a rich way to communicate one’s history of work (these are ubiquitous in art and design professions). Students’ writing about themselves can reveal interests, ideas, ambitions, and visions (these are already a central part of admissions processes). Recommendations can reveal someone’s experiences with a learner over time. Of course, all of these other indicators of knowledge and ability have their own problems around bias, but at least they don’t come with the semblance of objectivity that a number or letter grade comes with; that creates space for conversations about how to avoid bias. And there are likely many more better ideas we haven’t invented because we’ve been so obsessed with grades as a primary indicator of ability.

Inside of schools, I don’t see a role for summative assessment at all. Why not just exclusively focus on formative assessment, which we know can strongly support learning and help students develop a growth mindset toward their learning? When it comes time for students to demonstrate their knowledge to organizations outside of schools, schools can help them create the portfolios, prepare for interviews, and do their writing (much like schools already do). And if teachers didn’t have to spend so much time on summative assessment, they could spend a lot more time supporting their students on these things.

All that said, I’m not even sure merit is a reasonable concept to begin with. On the latest episode of On the Media, there was a nice segment about the origins of the word “meritocracy” and how it was originally meant satirically, as framing of a merit-based world as just another form of aristocracy. The interview was unpacking the idea that in our efforts to try to distinguish between individuals for the purposes of allocating fixed resources, we’ve forgotten that some resources aren’t actually fixed. If the United States wanted to, everyone could go to college, as is the case in most of the rest of the developed world. If we wanted to, everyone could have an outstanding public education, as is true in many European and Asian countries. If we wanted to, everyone could get internships in a career of interest, if we created incentives for the 28 million small businesses in America to invite students to their teams to learn. There’s just weak political will to support everyone in the U.S. right now, so we default to the idea that there will be winners and losers and we’ll choose them based on “merit.”

Of course, it’s easy to imagine that some resources are fixed. If the University of Washington admitted everyone, we’d have to find ways of teaching hundreds of thousands of learners, housing hundreds of thousands of learners, and finding offices for ten times as many faculty. We’d need need ten times as much space to do this, which would either mean building on top of our lakes or building really high. And our city would have to absorb all of the new demand for food and services generated by this massive new population of learners, dramatically changing the city’s culture. All of this is possible, of course — we just don’t want to do it. The real fixed resource is our ability to dream, our desire for equity, and our capacity for change.

I’m not the first person to make these critiques of grading. Other universities have already moved beyond grades. There are countless research papers and vast bodies of knowledge in education research enumerating flaws with summative assessments (including screeds just like mine from two decades ago). And most teachers I’ve met really has no love grading either. We’ve created a system that doesn’t work, that no one wants, and that’s propped up ultimately by a desire for fairness that is really a desire for everyone having access to what they need to thrive. So let’s give up on the failed idea of grades, and the failed idea of zero sum games, and start working on better ideas for giving every learner what they need to succeed.

M	T	W	T	F	S	S
« Apr
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31