Introduction
Teaching is a core part of the mission of institutions of higher education. It will be evaluated. The question is how (Knapper, 2001).
Student evaluations of teaching (SETs) are now ubiquitous in universities. Ohio University invests significant time, effort, and money into administering SETs, transmitting the data to faculty, and then analyzing and using it for both summative and formative evaluation of faculty members. [1] Our Faculty Learning Community sought to understand the ways in which this machinery is, and is not, serving its presumed goal of improving teaching at our institution. Based on selected readings from the copious research literature on SETs we came to the conclusion that there are useful things that can be learned from SETs, but they should be employed for the evaluation of teaching only within the context of other material. The information in SETs must be collected and interpreted with some care.
Our findings on SETs are informed by an understanding of teaching as a multi-dimensional activity. Effective teaching involves comprehensive subject knowledge, proper preparation and organization, intelligent design of course materials, stimulation of student interest, and many other aspects. Using an ?overall score? as a measure of teaching effectiveness attempts to represent something in this high-dimensional space as a single number. This is highly reductive. The best-case scenario is that it?s useful if the different aspects of effective teaching are weighted in a carefully chosen way. The worst-case scenario is that the number it generates has nothing whatsoever to do with good teaching.
Two particular concerns that we will discuss further below are that while students are well-positioned to evaluate their own experience in a class, there are simply subjects on which they are not qualified to opine, e.g., details of the instructor?s subject expertise. Another key concern is that numerical SET rankings are frequently interpreted as more precise than they really are. Even if the SET score in a particular category is meaningful, measurement uncertainty means that a score of 4.7, say, is not demonstrably superior to a score of 4.5. Excessive concentration on the scores obtained by faculty in SETs leads to a system that focuses on noise in the SET data. This problem is amplified by research which shows that bias (as well as other factors) can affect SET scores at a level markedly larger than the 0.2 difference of this example.
These then, are the broad philosophical underpinnings of our report: SETs are here to stay and we want a system in which they are used for meaningful improvement and assessment of the multi-dimensional activity that teaching is, not for mindless ranking of faculty. The accompanying one-page document summarizes the recommendations that we think will help our college achieve that end. In this report we give further rationale for these recommendations, including references to some relevant literature. We first describe the process by which the FLC investigated these issues, then describe our key findings regarding how SET data should (and should not) be used, discuss how more, and better, SET data can be collected, and conclude by describing strategies that complement the SETs and provide a fuller picture of teaching effectiveness.
FLC process
Daniel Phillips (Physics and Astronomy) and Laurie Hatch (Associate Dean; Sociology and Anthropology) coordinated this Faculty Learning Community, which also included seven other faculty members from a range of departments in the College of Arts & Sciences, as well as Tim Vickers, the Director of the Center for Center for Teaching, Learning, and Assessment. The group met eight times during the 2016-17 AY in order to discuss various aspects of SETs, and presented our findings to the college in April 2017.
First, we discussed our departments? current practices?how SETs are administered, how they are used by the P&T and yearly Peer Evaluation Committees, and how they are perceived informally. Next, after reading relevant research, we discussed the influence of various factors on SETs, such as the students? class ranks, whether the course was required or elective, and whether the class was perceived as difficult. The following meeting focused on how different aspects of the instructor?s identity (gender, race, ethnicity, foreign accent, age, etc.) and personality affected how the students evaluated their teaching. We also examined the context in which SETs are administered?whether they are filled out during class, whether students are given incentives to provide comments?in order to discern the best approach to getting a high response rate without distorting the students? responses.
In January Tim Vickers administered a focus group with seven undergraduates who were recruited by FLC members. He asked them a number of questions regarding their perception of and responses to SETs. The discussion lasted for about 90 minutes, and a transcript is available upon request. The focus group was undoubtedly not representative, but we found their misconceptions about the SETs to be surprising and important information. We then discussed research that recommends a broader approach to evaluating teaching and how these methods (such as supplementing SET scores with something like a teaching portfolio) could be used by departments at Ohio University. Our final meeting concentrated on distilling what we had learned from our previous meetings, in order to offer advice for improved use of SETs at the faculty, department, and college levels.
After discussing the research and our own experiences, we agreed on two basic points. First, SETs offer only a limited (and sometimes flawed) perspective on teaching. They should not be the sole source of information regarding a faculty member?s teaching performance. Second, SETs should be administered and analyzed carefully, in order to obtain the information that they areable to offer in the best way possible.
Better Practices Regarding Use of Data from SETs
Every SET yields an ?overall score? for the instructor in a particular class. In Class Climate this is the ?Global Index? and is obtained by averaging the scores for the four questions in the ?Instructor Evaluation.? The simplest way to use SETs for summative evaluation of teaching is to take this single number as representative of the instructor?s effectiveness in the class. But, current research implies the overall score has no correlation with teaching effectiveness. There is a standard research methodology on the question of whether such a correlation exists, based on multiple sections of a large class with a common exam. Early claims were that there is a positive correlation between the overall score an instructor receives and the average grade for the class in that person's section (Cohen, 1981; Marsh, 2007). However, a recent meta-analysis finds that, averaging over all such studies while correcting for the effects of study size, there is no correlation (Uttl, White, and Gonzalez, 2016). Using an average measure such as the Global Index to evaluate teaching effectiveness is clear, simple, and wrong. We therefore recommend that departments focus on scores for individual questions and written comments, rather than just the overall score (in Class Climate this is the ?Global Index?) or, indeed, any one SET number.
Nevertheless, we also think there is useful information in the Class Climate Likert-scale questions that go into this average. Those questions tell us students? perception of instructor clarity, availability, and in-class environment. Each represents a particular aspect of the faculty member?s teaching persona. But, even when examining one of these dimensions of teaching (e.g., class environment), blind ranking of faculty by their score is a mistake. For one thing, those scores are distributed over a range?presumably representing natural variability in students? experience. It may not be statistically sensible to say that Professor A is doing better than Professor B at ?creating an environment conducive to learning? if their scores for that question differ by only a few tenths of a point?especially if the sample sizes are small.
These concerns intensify upon consideration of systematic effects in the SET score. Some faculty perceptions regarding such effects are myths, e.g., increased workload, rather than leading to a decrease in SET scores, in fact tends to produce slightly higher scores, at least up to the point where assignments become perceived as ?busy work? (Marsh, 2007). But, class size, whether the class is required or elective, etc. all have?to a varying degree?a discernible influence on the overall score (Marsh, 2007): they must influence the scores for individual questions too. This means we must consider external variables (e.g., required vs. elective class) over which the instructor has no control.
We are particularly troubled by the effect that gender, racial, or other bias has on SET scores and comments. A clever study in an online class by MacNell et al. (MacNell, Driscoll, and Hunt ,2015) demonstrated that instructors perceived to be female are rated lower than their male counterparts. The study was in an online class, and the male and female instructors in two different sections switched gender identities, carefully controlling their assignments, length of comments, etc. so that they were teaching the two sections in the same way?insofar as is possible. Nevertheless, the instructor students thought was female (who was actually male) received systematically lower SET scores: a particularly galling example is the 3.55 vs. 4.35 score (out of 5) for prompt returning of work, when the amount of time to return work was exactly the same. Similar disparities were seen in some other questions. The fact that such bias exists provides additional reason not to use those scores blindly to evaluate faculty teaching. Particular care is therefore necessary with evaluations of women, minorities, and non-native speakers. The literature has extensively documented that student biases often punish faculty for ?norm violations? (i.e., not matching expectations of how ?a professor? should appear).
One possible conclusion, given these problems with SETs, is that there is nothing to learn from them and we should ignore them. Aside from the fact that this is not a politically sustainable position, we concluded that there is good information in SETs. They represent students? own experience of the class. Naturally this is filtered through each student?s own lens, and effort to deconvolve the way that lens distorts their evaluation of teaching may be necessary. It is sensible neither to trust SETs absolutely nor to throw them out completely.
The comments of the focus group we convened showed that students can be very thoughtful about SETs; they often strive to provide useful comments to professors?especially if they perceive that what they write will be taken seriously. The anecdotal experience of FLC members is that we frequently find useful information that improves our teaching in the written comments on SETs. We note that ?Class Climate? (or equivalent) would be more useful if students ? numerical scores and written comments could be grouped together.Minimally this would allow identification of cases when students used the wrong end of the Likert scale. It would certainly aid better interpretation of the meaning of Class Climate?s Likert-scale scores; for example, if a student writes that the professor always returned work on time and provided useful feedback, and then gives a score of 4 for those questions. More positively, the correlations between written comments and numerical scores would provide a fuller picture of what a particular student did or did not appreciate in the class.
Because reading SETs in particular, and evaluating teaching in general, is complicated work, we recommend that departments form teaching evaluation committees to perform that segment of annual evaluations.Care is needed to tease real information on teaching effectiveness out of SETs. We note that some departments already have separate committees to perform annual evaluations of teaching, research, and service. An evaluation committee with a narrower remit has more ability to dig into SETs in more detail.
We also recommend that the college de-emphasize the ?overall score? in its P&T process. We find particularly pernicious the table that requires those going up for P&T to report their overall score in each class and compare it with the departmental average ?in classes of a similar type.? If the overall score does not correlate with effective teaching then what purpose does this serve? Instead, just as in the departmental case, discussion of all aspects of SETs, including qualitative comments, will provide a fuller picture?a picture that should be complemented by the other teaching materials submitted in the dossier. We think the college should re-evaluate the use of overall score in P&T dossiers.
Obtaining Better SET Data
When designed, administered and analyzed carefully, SETs can provide reliable information on the students? experience in class. As a group we discussed two aspects of SET administration: response rate and ways to get more information from the survey.
Over the past couple of decades, SET surveys have migrated from being administered in-class to online. Research implies that the mean teaching evaluation for online SETs is not different from that for paper-based evaluations (Ling et al., 2012)?in contrast to the perception of many faculty. Those students who respond also write more on the open-ended comments than is the case with paper evaluations (Bruce). And online SETs can be administered and stored more cheaply and more efficiently than can paper ones.
However, the survey of a group produces reliable data only if a representative portion of this group provides input, and response rates for online SETs are a concern. At the beginning of the online transition, many universities reported that the SET response rate dropped from 80 percent to 50 percent. At Ohio University, the College of Arts & Sciences response rate was 44 percent in Fall 2012, the first semester our college used online evaluation. Response rates in the college have varied between 52 and 60 percent over the past two years.
In his paper published in 2012, R. Berk listed the ?Top 20 Strategies to Increase the Online Response Rates of Student Rating Scales.? Half of these strategies relate to the improvement of the technical aspect of the online survey itself. The rest of these strategies are incentives to urge students to fill in the survey. Some of these incentives involve quid pro quo s for filling out the survey: being entered into a drawing to win a prize, receiving extra credit, or the instructor providing food on the day evaluations are performed. Most of these are ethically problematic and many play into the commodification of higher education by encouraging students to think of their SET as a customer satisfaction survey (Berk, 2012). We recommend that faculty members do not use incentives (food, extra credit) for students to complete evaluations.
Instead, several effective strategies for improving response rates emerged from the research and our focus group. These involve convincing students that their input is anonymous and important. Students? motivation to complete SETs is markedly higher if they expect they can provide meaningful feedback (Giesey, Chen, Hoshower, 2004), but they worry about a ?perceived lack of anonymity? (Berk, 2012). Students in our focus group were indeed unclear about who is reading their evaluations, who is managing the SETs, and what they are used for. Exhortation and reminders by both the institution and individual professors can help achieve the goal of increased SET response rates, so we recommend that faculty members use a script or slide to explain the purposes of SETs, emphasizing the role of written comments in changing methods of instruction. In addition, a good way to prove to students that their input is valued is to set aside class time?if possible?for the completion of evaluations.These measures should improve response rates and give us the advantages of SETs administered online.
In the longer term the college could consider studying the validity and reliability of its SET instrument. We decided that such issues of survey design were beyond the scope of our FLC. But, it may be worth investing time in studying them in the future if our institution is going to maximize all the effort it puts into SET administration, data collection, and interpretation.
Instituting Better Practices for Teaching Evaluation
SETs give only a limited vision of a teacher?s process and expertise. Recent research suggests that teaching portfolios give a better window into teachers? performance in the classroom, while aggregating diverse enough sources of information so as to offset potential biases (Knapper and Wright, 2001). SETs would be included in a teaching portfolio, but it would also include a range of other, scaffolding materials, including, but not limited to, a teaching philosophy, peer teaching evaluations, supplemental course materials, etc. (Knapper and Wright, 2001; Seldin, 2004).
The usefulness of teaching portfolios to committees that perform evaluation is well-documented. However, they are also useful to the faculty members who create them (Seldin, 2004). They can be used, for example, when applying for grants or teaching awards, in tenure and post-tenure reviews, etc. (Seldin, 2004). Moreover, they ?leave a written legacy within the department or institution so that future generations of teachers who will be taking over the courses of about-to-retire professors will have the benefit of their thinking and experience? (Seldin, 2004). That is to say, teaching portfolios can be useful tools in mentoring junior faculty. Portfolios allow for both summative and formative feedback at different stages of a professor?s career: Although summative feedback might seem the most prevalent use of the materials (i.e., for tenure and post-tenure reviews), they can also be used for reflection for professors thinking critically about their teaching profile (Knapper and Wright, 2001). Our group noticed that the materials faculty provide to the college P&T committee when they go up for promotion and/or tenure approximate a teaching portfolio. We think departments should use more than SETs in faculty evaluation: both annually and for P&T. A wider range of teaching evaluation materials, e.g., a subset of those used for P&T dossiers, allows for a fuller picture of a faculty member ? s teaching.
One key reason that such a system offers a useful heuristic for evaluating teaching is that it includes as part of its implementation a departmental conversation concerning what constitutes effective instruction. We recommend that departments facilitate faculty discussions concerning definitions of and goals for good teaching.How, though, do we begin to structure this conversation? Again, the literature provides a useful set of tools for thinking through how departments might do this. One important avenue of consideration is the alignment of learning outcomes with methods of teaching and assessment (Knapper, 2001). Teaching portfolios offer particularly useful data for articulating and understanding such alignment.
Lastly, we suggest that the college create more opportunities for discussions of what constitutes effective teaching, including further workshops, faculty learning communities, etc.This will help create a culture where the complexity of teaching is valued and SETs are appreciated for the insights they can offer. In understanding our teaching in a more holistic context, one that takes into account more than just the numbers obtained in SETs, we can continue to put front-and-center the most important goals of the university. As Knapper argues, ?the whole process helps stimulate debate about the university?s central mission: teaching and learning? (Knapper 2001: 5). Encouraging a culture of teaching excellence requires opportunities for such conversations not only within departments, but across the college. By creating these opportunities, we also create the space to mentor junior faculty and support senior faculty as they hone their teaching craft.
[1] Summative evaluation is the kind we conduct every year to determine faculty raises, resolve P&T cases, etc. It seeks a judgment about the effectiveness with which the task being evaluated was completed. Formative evaluation has a different goal: it aims to help the person being evaluated identify things they are doing well and areas for improvement. The two kinds of evaluation are clearly connected, and the same materials can be used for both.
References
R. W. Bruce, An Online Evaluation Pilot Study in the College of Arts, Humanities, and Social Sciences, Humboldt State University (unpublished)
P. A. Cohen, Review of Educational Research (1981) 51(3), 281-309.
J. Giesey, Y. Chen, L. B. Hoshower, J. Eng. Educ. (2004), 93, 303-312.
C. Knapper, New Directions for Teaching and Learning (2001), 88:3.
C. Knapper and W. A. Wright, New Directions for Teaching and Learning (2001), 88:19.
T. Ling, J. Phillips, S. Weihrich, J. Acad. Bus. Ed. (2012), 12, 150-161.
L. MacNell, A. Driscoll, and A. N. Hunt, Innov High Educ (2015), 40:291-303.
H. W. Marsh, in R. P. Perry and J.C. Smart (eds.), The Scholarship of Teaching and Learning in Higher education: an Evidence-Based Perspective (Springer, 2007), pp. 319-383.
P. Seldin, The Teaching Portfolio (2004).
B. Uttl, C. A. White, and D. W. Gonzalez, Studies in Educational Evaluation (2016), in press, http://doi.org/10.1016/j.stueduc.2016.08.007.
Members of the Faculty Learning Community on Evaluating Student Evaluations of Teaching
Claudia Gonzalez-Vallejo (Psychology)
Dan Hembree (Geological Sciences)
Mary Kate Hurley (English)
Jaclyn Maxwell (History/CWR)
Daniel Phillips (Physics & Astronomy)
Julie Roche (Physics & Astronomy)
Rose Rossiter (Economics)
Nancy Tatarek (Sociology & Anthropology)
Laurie Hatch (College of Arts & Science Dean?s Office, Sociology & Anthropology) and Tim Vickers (Center for Teaching, Learning, and Assessment) also participated in the Faculty Learning Community.