Empirical Education is excited to share that we recently completed two Investing In Innovation (i3) (now EIR) evaluations for the Making Sense of SCIENCE program and the Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) programs. We thank the staff on both programs for their fantastic partnership. We also acknowledge Anne Wolf, our i3 technical assistance liaison from Abt Associates, as well as our Technical Working Group members on the Making Sense of SCIENCE project (Anne Chamberlain, Angela DeBarger, Heather Hill, Ellen Kisker, James Pellegrino, Rich Shavelson, Guillermo Solano-Flores, Steve Schneider, Jessaca Spybrook, and Fatih Unlu) for their invaluable contributions. Conducting these two large-scale, complex, multi-year evaluations over the last five years has not only given us the opportunity to learn much about both programs, but has also challenged our thinking—allowing us to grow as evaluators and researchers. We now reflect on some of the key lessons we learned, lessons that we hope will contribute to the field’s efforts in moving large-scale evaluations forward.
Background on Both Programs and Study Summaries
Making Sense of SCIENCE (developed by WestEd) is a teacher professional learning model aimed at increasing student achievement through improving instruction and supporting districts, schools, and teachers in their implementation of the Next Generation Science Standards (NGSS). The key components of the model include building leadership capacity and providing teacher professional learning. The program’s theory of action is based on the premise that professional learning that is situated in an environment of collaborative inquiry and supported by school and district leadership produces a cascade of effects on teachers’ content and pedagogical content knowledge, teachers’ attitudes and beliefs, the school climate, and students’ opportunities to learn. These effects, in turn, yield improvements in student achievement and other non-academic outcomes (e.g., enjoyment of science, self-efficacy, and agency in science learning). NGSS had just been introduced two years prior to the study, a study which ran from 2015 through 2018. The infancy of NGSS and the resulting shifting landscape of science education posed a significant challenge to our study, which we discuss below.
Our impact study of Making Sense of SCIENCE was a cluster-randomized, two-year evaluation involving more than 300 teachers and 8,000 students. Confirmatory impact analyses found a positive and statistically significant impact on teacher content knowledge. While impact results on student achievement were mostly all positive, none reached statistical significance. Exploratory analyses found positive impacts on teacher self-reports of time spent on science instruction, shifts in instructional practices, and amount of peer collaboration. Read our final report here.
CREATE is a three-year teacher residency program for students of Georgia State University College of Education and Human Development (GSU CEHD) that begins in their last year at GSU and continues through their first two years of teaching. The program seeks to raise student achievement by increasing teacher effectiveness and retention of both new and veteran educators by developing critically-conscious, compassionate, and skilled educators who are committed to teaching practices that prioritize racial justice and interrupt inequities.
Our impact study of CREATE used a quasi-experimental design to evaluate program effects for two staggered cohorts of study participants (CREATE and comparison early career teachers) from their final year at GSU CEHD through their second year of teaching, starting with the first cohort in 2015–16. Confirmatory impact analyses found no impact on teacher performance or on student achievement. However, exploratory analyses revealed a positive and statistically significant impact on continuous retention over a three-year time period (spanning graduation from GSU CEHD, entering teaching, and retention into the second year of teaching) for the CREATE group, compared to the comparison group. We also observed that higher continuous retention among Black educators in CREATE, relative to those in the comparison group, is the main driver of the favorable impact. The fact that the differential impacts on Black educators were positive and statistically significant for measures of executive functioning (resilience) and self-efficacy—and marginally statistically significant for stress management related to teaching—hints at potential mediators of impact on retention and guides future research.
After the i3 program funded this research, Empirical Education, GSU CEHD, and CREATE received two additional grants from the U.S. Department of Education’s Supporting Educator Effectiveness Development (SEED) program for further study of CREATE. We are currently studying our sixth cohort of CREATE residents and will have studied eight cohorts of CREATE residents, five cohorts of experienced educators, and two cohorts of cooperating teachers by the end of the second SEED grant. We are excited to continue our work with the GSU and CREATE teams and to explore the impact of CREATE, especially for retention of Black educators. Read our final report for the i3 evaluation of CREATE here.
While there were many lessons learned over the past five years, we’ll highlight two that were particularly challenging and possibly most pertinent to other evaluators.
The first key challenge that both studies faced was the availability of valid and reliable instruments to measure impact. For Making Sense of SCIENCE, a measure of student science achievement that was aligned with NGSS was difficult to identify because of the relative newness of the standards, which emphasized three-dimensional learning (disciplinary core ideas, science and engineering practices, and cross-cutting concepts). This multi-dimensional learning stood in stark contrast to the existing view of science education at the time, which primarily focused on science content. In 2014, one year prior to the start of our study, the National Research Council pointed out that “the assessments that are now in wide use were not designed to meet this vision of science proficiency and cannot readily be retrofitted to do so” (NRC, 2014, page 12). While state science assessments that existed at the time were valid and reliable, they focused on science content and did not measure the type of three-dimensional learning targeted by NGSS. The NRC also noted that developing new assessments would “present[s] complex conceptual, technical, and practical challenges, including cost and efficiency, obtaining reliable results from new assessment types, and developing complex tasks that are equitable for students across a wide range of demographic characteristics” (NRC, 2014, p.16).
Given this context, despite the research team’s extensive search for assessments from a variety of sources—including reaching out to state departments of education, university-affiliated assessment centers, and test developers—we could not find an appropriate instrument. Using state assessments was not an option. The states in our study were still in the process of either piloting or field testing assessments that were aligned to NGSS or to state standards based on NGSS. This void of assessments left the evaluation team with no choice but to develop one, independently of the program developer, using established items from multiple sources to address general specifications of NGSS, and relying on the deep content expertise of some members of the research team. Of course there were some risks associated with this, especially given the lack of opportunity to comprehensively pilot or field test the items in the context of the study. When used operationally, the researcher-developed assessment turned out to be difficult and was not highly discriminating of ability at the low end of the achievement scale, which may have influenced the small effect size we observed. The circumstances around the assessment and the need to improvise a measure leads us to interpret findings related to science achievement of the Making Sense of SCIENCE program with caution.
The CREATE evaluation also faced a measurement challenge. One of the two confirmatory outcomes in the study was teacher performance, as measured by ratings of teachers by school administrators on two of the state’s Teacher Assessment on Performance Standards (TAPS), which is a component of the state’s evaluation system (Georgia Department of Education, 2021). We could not detect impact on this measure because the variance observed in the ordinal ratings was remarkably low, with ratings overwhelmingly centered on the median value. This was not a complete surprise. The literature documents this lack of variability in teaching performance ratings. A seminal report, The Widget Effect by The New Teacher Project (Weisberg et al., 2009), called attention to this “national crisis”—the inability of schools to effectively differentiate among low- and high-performing teachers. The report showed that in districts that use binary evaluation ratings, as well as those that use a broader range of rating options, less than 1% of teachers received a rating of unsatisfactory. In the CREATE study, the median value was chosen overwhelmingly. In a study examining teacher performance ratings by Kraft and Gilmour (2017), principals in that study explained that they were more reluctant to give new teachers a rating below proficient because they acknowledge that new teachers were still working to improve their teaching, and that “giving a low rating to a potentially good teacher could be counterproductive to a teacher’s development.” These reasons are particularly relevant to the CREATE study given that the teachers in our study are very early in their teaching career (first year teachers), and given the high turnover rate of all teachers in Georgia.
We bring up this point about instruments as a way to share with the evaluation community what we see as a not uncommon challenge. In 2018 (the final year of outcomes data collection for Making Sense of SCIENCE), when we presented about the difficulties of finding a valid and reliable NGSS-aligned instrument at AERA, a handful of researchers approached us to commiserate; they too were experiencing similar challenges with finding an established NGSS-aligned instrument. As we write this, perhaps states and testing centers are further along in their development of NGSS-aligned assessments. However, the challenge of finding valid and reliable instruments, generally speaking, will persist as long as educational standards continue to evolve. (And they will.) Our response to this challenge was to be as transparent as possible about the instruments and the conclusions we can draw from using them. In reporting on Making Sense of SCIENCE, we provided detailed descriptions of our process for developing the instruments and reported item- and form-level statistics, as well as contextual information and rationale for critical decisions. In reporting on CREATE, we provided the distribution of ratings on the relevant dimensions of teacher performance for both the baseline and outcome measures. In being transparent, we allow the readers to draw their own conclusions from the data available, facilitate the review of the quality of the evidence against various sets of research standards, support replication of the study, and provide further context for future study.
A second challenge was maintaining a consistent sample over the course of the implementation, particularly in multi-year studies. For Making Sense of SCIENCE, which was conducted over two years, there was substantial teacher mobility into and out of the study. Given the reality of schools, even with study incentives, nearly half of teachers moved out of study schools or study-eligible grades within schools over the two year period of the study. This obviously presented a challenge to program implementation. WestEd delivered professional learning as intended, and leadership professional learning activities all met fidelity thresholds for attendance, with strong uptake of Making Sense of SCIENCE within each year (over 90% of teachers met fidelity thresholds). Yet, only slightly more than half of study teachers met the fidelity threshold for both years. The percentage of teachers leaving the school was congruous with what we observed at the national level: only 84% of teachers stay as a teacher at the same school year-over-year (McFarland et al., 2019). For assessing impacts, the effects of teacher mobility can be addressed to some extent at the analysis stage; however, the more important goal is to figure out ways to achieve fidelity of implementation and exposure for the full program duration. One option is to increase incentivization and try to get more buy-in, including among administration, to allow more teachers to reach the two-year participation targets by retaining teachers in subjects and grades to preserve their eligibility status in the study. This solution may go part way because teacher mobility is a reality. Another option is to adapt the program to make it shorter and more intensive. However, this option may work against the core model of the program’s implementation, which may require time for teachers to assimilate their learning. Yet another option is to make the program more adaptable; for example, by letting teachers who leave eligible grades and school to continue to participate remotely, allowing impacts to be assessed over more of the initially randomized sample.
For CREATE, sample size was also a challenge, but for slightly different reasons. During study design and recruitment, we had anticipated and factored the estimated level of attrition into the power analysis, and we successfully recruited the targeted number of teachers. However, several unexpected limitations arose during the study that ultimately resulted in small analytic samples. These limitations included challenges in obtaining research permission from districts and schools (which would have allowed participants to remain active in the study), as well as a loss of study participants due to life changes (e.g., obtaining teaching positions in other states, leaving the teaching profession completely, or feeling like they no longer had the time to complete data collection activities). Also, while Georgia administers the Milestones state assessment in grades 4–8, many participating teachers in both conditions taught lower elementary school grades or non-tested subjects. For the analysis phase, many factors resulted in small student samples: reduced teacher samples, the technical requirement of matching students across conditions within each cohort in order to meet WWC evidence standards, and the need to match students within grades, given the lack of vertically scaled scores. While we did achieve baseline equivalence between the CREATE and comparison groups for the analytic samples, the small number of cases greatly reduced the scope and external validity of the conclusions related to student achievement. The most robust samples were for retention outcomes. We have the most confidence in those results.
As a last point of reflection, we greatly enjoyed and benefited from the close collaboration with our partners on these projects. The research and program teams worked together in lockstep at many stages of the study. We also want to acknowledge the role that the i3 grant played in promoting the collaboration. For example, the grant’s requirements around the development and refinement of the logic model was a major driver of many collaborative efforts. Evaluators reminded the team periodically about the “accountability” requirements, such as ensuring consistency in the definition and use of the program components and mediators in the logic model. The program team, on the other hand, contributed contextual knowledge gained through decades of being intimately involved in the program. In the spirit of participatory evaluation, the two teams benefited from the type of organization learning that “occurs when cognitive systems and memories are developed and shared by members of the organizations” (Cousins & Earl, 1992). This type of organic and fluid relationship encouraged the researchers and program teams to embrace uncertainty during the study. While we “pre-registered” confirmatory research questions for both studies by submitting the study plans to NEi3 prior to the start of the studies, we allowed exploratory questions to be guided by conversations with the program developers. In doing so, we were able to address questions that were most useful to the program developers and the districts and schools implementing the programs.
We are thankful that we had the opportunity to conduct these two rigorous evaluations alongside such humble, thoughtful, and intentional (among other things!) program teams over the last five years, and we look forward to future collaborations. These two evaluations have both broadened and deepened our experience with large-scale evaluations, and we hope that our reflections here not only serve as lessons for us, but that they may also be useful to the education evaluation community at large, as we continue our work in the complex and dynamic education landscape.
Cousins, J. B., & Earl, L. M. (1992). The case for participatory evaluation. Educational Evaluation and Policy Analysis, 14(4), 397-418.
Georgia Department of Education (2021). Teacher Keys Effectiveness System. https://www.gadoe.org/School-Improvement/Teacher-and-Leader-Effectiveness/Pages/Teacher-Keys-Effectiveness-System.aspx
Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234-249.
McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Forrest Cataldi, E., Bullock Mann, F., and Barmer, A. (2019). The Condition of Education 2019 (NCES 2019-144). U.S. Department of Education. National Center for Education Statistics. https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2019144
National Research Council (NRC). (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. The National Academies Press.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New Teacher Project. https://tntp.org/wp-content/uploads/2023/02/TheWidgetEffect_2nd_ed.pdf