Notable academic Twitter person Terry McGlynn recently wrote a blog post arguing for changes to the structure of student ratings of teaching (SRTs) that would make them more useful. While I entirely agree with the premise that SRTs as currently constructed aren’t particularly useful, I’m not sure I’m on board with the rest of the argument. In particular, I got to wondering along two parallel tracks:
- Do I believe that Terry’s proposed changes will produce a more useful set of data for the evaluation of teaching effectiveness?
- If indeed I do believe that this data is more useful, useful for who?
Some background and positioning
It’s important to me to start with some background here so that it’s clear who I am and where I’m coming from. I’m relatively new faculty — this is my fifth year out of my Ph.D. in mathematics education — in my first year at a new job in the math department at Westminster College. (Standard disclaimer that this blog is my opinion and not theirs etc. etc.) I’m an out gay man, and I think it’s important for my students to see examples of successful Gays In STEM™, so I don’t hesitate to talk about my husband in any situation where a spouse might come up in conversation.
I had a particularly bad experience with SRTs two years ago, about which I will probably blog eventually, but suffice it to say that I experienced a significant instance of homophobic bias. This caused me to dig into the (substantial!) body of research on SRTs; I’ve not yet found much redeeming about them. We know that quantitative scores exhibit statistically significant biases along the lines of race and gender, but also along disciplinary lines and lines of seniority. What’s more, we also know that SRTs are not correlated with measures of student learning, but that they are correlated with the availability of cookies. Universities and organizations including Ryerson, Oregon, USC, and even the AAUP have come to the conclusion that they can’t legally be used in tenure and promotion decisions (because if a professor from an underrepresented background was denied tenure on the basis of SRTs, they could quite reasonably sue the pants off their university for using racially-biased data). In short, student ratings of teaching are a dumpster fire.
You may have noticed by now that I’m not using the more common term “student evaluations of teaching,” and maybe you’re wondering if that’s on purpose. Reader, it is: calling these things “evaluations” ascribes to students particular skills and bodies of knowledge that they by definition don’t have. Students are not trained evaluators of teaching, and so we can’t reasonably call any data they produce evaluations. However, I think it’s fair to call this data ratings or feedback.
I’m not asking for the complete abolition of SRTs; I do think it’s important to give students a voice. However, I’m asking for us to think harder about what this data is and isn’t good for, what we can and can’t meaningfully conclude from it, and if we can’t come up with a better way of hearing student voices, for some meaning of the word “better”.
Will Terry’s changes produce more useful data?
So that leads naturally to my first question. Certainly, one way we could operationalize “better” is as meaning “more useful.” Terry argues that a particular set of changes would produce useful SRTs. Let’s take a look.
Terry proposes that we should ask students “unambiguous questions that reflect explicit performance criteria.” For instance, we might ask, “Was the instructor late to class on a regular basis?” or “Did the instructor use disparaging language about a student in the class?” or “Was your instructor present at posted office hours?” These kinds of questions do seem like an improvement. With apologies to John Hodgman, specificity is the soul of data, and at least these questions are way more specific than the kind of open-ended “how did this class go” questions that too often populate SRTs.
My question, though, is whether these questions are specific about the right things. Terry is arguing for questions assessing whether “instructors are meeting baseline performance criteria.” Which, sure. Let’s detect the “derelict tenured professor who fails to do their job at the minimum level expected of them.” But how useful is this really? How many “derelict” people are we truly going to detect? And for (I’m willing to venture) the vast majority of instructors who at least minimally care about their teaching responsibilities, what do we learn from this data?
Useful for who?
It doesn’t seem to me like these questions are useful for anybody but administrators looking to detect the worst offenders. Without getting too Marxist about this: as labor, I’m in general not stoked about handing more punitive tools to capital. I don’t know what kind of positive changes most instructors can make based on the responses to questions like the ones Terry is suggesting.
Can’t we make something that’s positive? Something that’s useful to instructors? For the vast majority of us who at least minimally care about our teaching responsibilities, can’t we make something that helps us actually improve our teaching skills?
A lot of people argue that SRTs as presently constituted do help them improve. When I talk to people about how SRTs are a dumpster fire, I often hear this: “Well, I’ve really learned a lot from reading my SRTs. I used to do [X] but then students gave me the idea to do [Y] instead and I started doing it and it went really well!” Which, fine, this happens. I’ve gotten good ideas from SRTs myself. But here’s a question I can’t help but ask whenever I hear this story: is [Y] a good thing to do, or is it just a popular thing to do? Was [X] a bad thing to do, or was it just unpopular? Too often, students’ perceptions of what’s good for their learning are diametrically opposed to what science tells us is actually good for their learning (another reason we should be deeply skeptical of student-produced data of teaching effectiveness).
What if, instead, we asked students about the incidence of specific evidence-based practices? Or, heck, since we’re skeptical about students as evaluators for all the very good reasons discussed above, why don’t we ask instructors themselves?
This is precisely the approach taken by the fine people at the Carl Wieman Science Education Initiative, who have developed the excellent Teaching Practices Inventory. This is a shortish (10-15 minutes) self-reflection that instructors can complete at the end of the term. It’s a structured way for instructors to think hard about what they do in their classrooms, and to think forward about how they can incorporate more evidence-based practices in their teaching. Speaking for myself, I’ve learned way more, and made way more substantial changes to my teaching, from this kind of structured reflection than I ever will from students complaining about the same three good-for-them-but-unpopular things on my SRTs for the rest of eternity.
Don’t worry, we’re not going to leave the students out of the fun. CWSEI researchers have also developed really good student surveys that get students to report the incidence of evidence-based practices — and thus also just maybe get students to see that there’s a gulf between evidence-based practices and their preconceptions about good education. And students’ perceptions of their learning experiences are important data — if instructors think they’re implementing evidence-based practices, but students don’t see them, then that’s a good sign for the instructor to rethink their implementation.
I don’t think these instruments are silver bullets. I fully expect that students’ implicit biases will continue to manifest in any quantitative instrument we ask them to fill out, and so we really need to keep thinking hard about how to incorporate this data into more holistic evaluation of teaching effectiveness. And self-report instruments are of course subject to people manipulating them, or not taking them seriously. But, dang, isn’t this at least a good start? We can create a system of student ratings of teaching that is positive, and that helps instructors teach better. And if that’s not manifestly the purpose of SRTs, then what (or who) are they actually for?