blog posts and news stories

Towards Greater (Local) Relevance of Causal Generalizations

To cite the paper we discuss in this blog post, use the reference below.

Jaciw, A. P., Unlu, F., & Nguyen, T. (2021). A within-study approach to evaluating the role of moderators of impact in limiting generalizations from “large to small”. American Journal of Evaluation.

Generalizability of Causal Inferences

The field of education has made much progress over the past 20 years in the use of rigorous methods, such as randomized experiments, for evaluating causal impacts of programs. This includes a growing number of studies on the generalizability of causal inferences stemming from the recognition of the prevalence of impact heterogeneity and its sources (Steiner et al., 2019). Most recent work on generalizability of causal inferences has focused on inferences from “small to large”. Studies typically include 30–70 schools while generalizations are made to inference populations at least ten times larger (Tipton et al., 2017). Such studies are typically used in informing decision makers concerned with impacts on broad scales, for example at the state level. However, as we are periodically reminded by the likes of Cronbach (1975, 1982) and Shadish et al. (2002), generalizations are of many types and support decisions on different levels. Causal inferences may be generalized not only to populations outside the study sample or to larger populations, but also to subgroups within the study sample and to smaller groups – even down to the individual! In practice, district and school officials who need local interpretations of the evidence might ask: “If a school reform effort demonstrates positive impact on some large scale, should I, as a principal, expect that the reform will have positive impact on the students in my school?” Our work introduces a new approach (or a new application of an old approach) to address questions of this type. We empirically evaluate how well causal inferences that are drawn on the large scale generalize to smaller scales.

The Research Method

We adapt a method from studies traditionally used (first in economics and then in education) to empirically measure the accuracy of program impact estimates from non-experiments. A central question is whether specific strategies result in better alignment between non-experimental impact findings and experimental benchmarks. Those studies—sometimes referred to as “Within-Study Comparison” studies (pioneered by Lalonde, 1986, and Fraker et al., 1987)—typically start with an estimate of a program’s impact from an uncompromised experiment. This result serves as the benchmark experimental impact finding. Then, to generate a non-experimental result, outcomes from the experimental control are replaced with those from a different comparison group. The difference in impact that results from this substitution measures the bias (inaccuracy) in the result that employs the non-experimental comparison. Researchers typically summarize this bias, and then try to remediate using various design and analysis-based strategies. (The Within-Study Comparison literature is vast and includes many studies that we cite in the article.)

Our Approach Follows a Within-Study Comparison Rationale and Method, but with a Focus on Generalizability.

We use data from the multisite Tennessee Student-Teacher Achievement Ratio (STAR) class size reduction experiment (described in Finn et al., 1990; Mosteller, 1995; Nye et al., 2000) to illustrate the application of our method. (We used 73 of the original 79 sites.) In the original study, students and teachers were randomized to small or regular-sized classes in grades K-3. Results showed a positive average impact of small classes. In our study, we ask whether a decisionmaker at a given site should accept this finding of an overall average positive impact as generalizable to his/her individual site.

We use the Within-Study Comparison Method as a Foundation.

First, we adopt the idea of using experimental benchmark impacts as the starting point. In the case of the STAR trial, each of the 73 individual sites yields its own benchmark value for impact. Second, consistent with Within-Study Comparisons, we select an alternative to compare against the benchmark. Specifically, we choose the average of impacts (the grand mean) across all sites as the generalized value. Third, we establish how closely this generalized value approximates impacts at individual sites (i.e., how well it generalizes “to the small”.) With STAR, we can do this 73 times, once for each site. Fourth, we summarize the discrepancies. Standard Within-Study Comparison methods typically average over the absolute values of individual biases. We adapt this, but instead use the average of 73 squared differences between the generalized impact and site-benchmark impacts. This allows us to capture the average discrepancy as a variance, specifically as the variation in impact across sites. We estimated this variation several ways, using alternative hierarchical linear models. Finally, we examine whether adjusting for imbalance between sites in site-level characteristics that potentially interact with treatment leads to closer alignment between the grand mean (generalized) and site-specific impacts. (Sometimes people wonder why, with Within-Study Comparison studies, if site-specific benchmark impacts are available, one would use less-optimal comparison group-based alternatives. With Within-Study Comparisons, the whole point is to see how closely we can replicate the benchmark quantity, in order to inform how well methods of causal inference (of generalization, in this case) potentially perform, in situations where we do not have an experimental benchmark.)

Our application is intentionally based on Within-Study Comparison methods. This is set out clearly in Jaciw (2010, 2016). Early applications with a similar approach can be found in Hotz, et al. (2005) and Hotz, et al. (2006). A new contribution of ours is that we summarize the discrepancy not as an average of absolute value of bias (a common metric in Within-Study Comparison studies), but as noted above, as a variance. This may sound like a nuanced technical detail, but we think it leads to an important interpretation: variation in impact is not just related to the problem of generalizability; rather, it directly indexes the accuracy (quantifies the degree of validity) of generalizations from “large to small”. We acknowledge Bloom et al. (2005) for the impetus for this idea, specifically, their insight that bias in Within-Study Comparison studies can be thought of as a type of “mismatch error”. Finally, we think it is important to acknowledge the ideas in G Theory from education (Cronbach et al., 1963; Shavelson et al., 2009). In that tradition, parsing variability in outcomes, accounting for its sources, and assessing the role of interactions among study factors, are central to the problem of generalizability.

Research Findings

First main result

The grand mean impact, on average, does not generalize reliably to the 73 sites. Before covariate adjustments, the average of the differences between the grand mean and the impacts at individual sites ranges between 0.41 and 0.25 standard deviations (SDs) of the outcome distribution, depending on the model used. After covariate adjustments, the average of the differences ranges between 0.41 and 0.17 SDs. (The average impact was about 0.25 SD.)

Second main result

Modeling effects of site-level covariates, and their interactions with treatment, only minimally reduced the between-site differences in impact.

The third main result

Whether impact heterogeneity achieves statistical significance depends on sampling error and correctly accounting for its sources. If we are going to provide accurate policy advice, we must make sure that we are not confusing random sampling error within sites (differences we would expect in results even if the program was not used) for variation in impact across sites. One source of random sampling error that is important but could be overlooked comes from classes. Given that teachers provide different value-added to students’ learning, we can expect differences in outcomes across classes. In STAR, with only a handful of teachers per school, the between-class differences easily add noise to the between-school outcomes and impacts. After adjusting for class random effects, the discrepancies in impact described above decreased by approximately 40%.

Research Conclusions

For the STAR experiment, the grand mean impact failed to generalize to individual sites. Adjusting for effects of moderators did not help much. Adjusting for class-level sampling error significantly reduced the level of measured heterogeneity. Even though the discrepancies decreased significantly after the class effects were included, the size of the discrepancies remained large enough to be substantively important, and therefore, we cannot conclude that the average impact generalized to individual sites.

In sum, based on this study, a policymaker at the site (school) level should apply caution in assessing whether the average result applies to his or her unique context.

The results remind us of an observation from Lee Cronbach (1982) about how a school board might best draw inferences about their local context serving a large Hispanic student body when program effects vary:

The school board might therefore do better to look at…small cities, cities with a large Hispanic minority, cities with well-trained teachers, and so on. Several interpretations-by-analogy can then be made….If these several conclusions are not too discordant, the board can have some confidence in the decision that it makes about its small city with well-trained teachers and a Hispanic clientele. When results in the various slices of data are dissimilar, it is better to try to understand the variation than to take the well-determined – but only remotely relevant – national average as the best available information. The school board cannot regard that average as superior information unless it believes that district characteristics do not matter (p. 167).

Some Possible Extensions of The Work

We’re looking forward to doing more work to continue to understand how to produce useful generalizations that support decision-making on smaller scales. Traditional Within-Study Comparison studies give us much food for thought, including about other designs and analysis strategies for inferring impacts to individual sites, and how to best communicate the discrepancies we observe and whether they are substantively large enough to matter for informing policy decisions and outcomes. One area of main interest concerns the quality of the moderators themselves; that is, how well they account for or explain impact heterogeneity. Here our approach diverges from traditional Within-Study Comparison studies. When applied to problems of internal validity, confounders can be seen as nuisances that make our impact estimates inaccurate. With regard to external validity, factors that interact with the treatment, and thereby produce variation in impact that affects generalizability, are not a nuisance; rather, they are an important source of information that may help us to understand the mechanisms through which the variation in impact occurs. Therefore, understanding the mechanisms relating the person, the program, context, and the outcome is key.

Lee Cronbach described the bounty of and interrelations among interactions in the social sciences as a “hall of mirrors”. We’re looking forward to continuing the careful journey along that hall to incrementally make sense of a complex world!


Bloom, H. S., Michalopoulos, C., & Hill, C. J. (2005). Using experiments to assess nonexperimental comparison -group methods for measuring program effect. In H. S. Bloom (Ed.), Learning more from social experiments (pp. 173 –235). Russell Sage Foundation.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30(2), 116–127.

Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137-163.

Cronbach, L. J. (1982). Designing Evaluations of Educational and Social Programs. Jossey-Bass.

Finn, J. D., & Achilles, C. M., (1990). Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27, 557-577.

Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Jaciw, A. P. (2010). Challenges to drawing generalized causal inferences in educational research: Methodological and philosophical considerations. [Doctoral dissertation, Stanford University.]

Jaciw, A. P. (2016). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, 40, 199-240.

Hotz, V. J., Imbens, G. W., & Klerman, J. A. (2006). Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the California GAIN Program. Journal of Labor Economics, 24, 521–566.

Hotz, V. J., Imbens, G. W. & Mortimer, J. H (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125, 241–270.

Lalonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76, 604–620.

Mosteller, F., (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5, 113-127.

Nye, B., Hedges, L. V., & Konstantopoulos, (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37, 123-151.

Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Shadish, W. R., Cook, T. D., & Campbell, D. T., (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Houghton Mifflin.

Shavelson, R. J., & Webb, N. M. (2009). Generalizability theory and its contributions to the discussion of the generalizability of research findings. In K. Ercikan & W. M. Roth (Eds.), Generalizing from educational research (pp. 13–32). Routledge.

Steiner, P. M., Wong, V. C. & Anglin, K. (2019). A causal replication framework for designing and assessing replication efforts. Zeitschrift fur Psychologie, 227, 280–292.

Tipton, E., Hallberg, K., Hedges, L. V., & Chan, W. (2017). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 41(5), 472–505.

Jaciw, A. P., Unlu, F., & Nguyen, T. (2021). A within-study approach to evaluating the role of moderators of impact in limiting generalizations from “large to small”. American Journal of Evaluation.

Photo by drmakete lab


Introducing SEERNet with the Goal of Replication Research

In 2021, we partnered with Digital Promise on a research proposal for the IES research network: Digital Learning Platforms to Enable Efficient Education Research Network. The project, SEER Research Network for Digital Learning Platforms (SEERNet) was funded through an IES education research grant in fall 2021, and we took off running. Digital Promise launched this SEERNet website to keep the community up to date on our progress. We’ve been meeting with five platform hosts, selected by IES, to develop ideas for replication research, generalizability in research, and rapid research.

The goal of SEERNet is to integrate rigorous education research into existing digital learning platforms (DLPs) in an effort to modernize research. The digital learning platforms have the potential to support education researchers as they study new ideas and seek to replicate those ideas quickly, across many sites, with a wide range of student populations and with a variety of education research topics. Each of the five platforms (listed below) will eventually have over 100,000 users, allowing us to explore ways to increase the efficiency of a replication study.

  1. Kinetic by OpenStax
  2. UpGrade/MATHia by Carnegie Learning
  3. Learning at Scale by Arizona State University
  4. E-Trials by ASSISTments
  5. Terracotta by Canvas

As the network leads, Empirical Education and Digital Promise will work to share best practices among the DLPs and build a community of researchers and practitioners interested in the opportunities afforded by these innovative platforms for impactful research. Stay tuned for more updates on how you can get involved!

This project is supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305N210034 to Digital Promise. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.


Introducing Our Newest Researchers

The Empirical Research Team is pleased to announce the addition of 2 new team members. We welcome Zahava Heydel and Chelsey Nardi on board as our newest researchers!

Zahava Heydel, Research Assistant

Zahava has taken on assisting Sze-Shun Lau with the CREATE project, a teacher residency program in Atlanta Public Schools invested in expanding equity in education by developing critically conscious, compassionate, and skilled educators.  Zahava’s experience as a research assistant at the University of Colorado Anschutz Medical Campus Department of Psychiatry, Colorado Center for Women’s Behavioral Health and Wellness is an asset to the Empirical Education team as we move toward evaluating SEL programs and individual student needs.

Chelsey Nardi, Research Manager

Chelsey is taking on the role of co-project manager for our evaluation of the CREATE project, working with Sze-Shun and Zahava. Chelsey is currently working toward her PhD exploring the application of antiracist theories in science education, which may support the evaluation of CREATE’s mission to develop critically conscious educators. Additionally, her research experience at McREL International and REL Pacific as a Research and Evaluation Associate has prepared her for managing some of our REL Southwest applied research projects. These experiences, coupled with her experience in project management, makes her an ideal fit for our team.


Empirical Education Wraps Up Two Major i3 Research Studies

Empirical Education is excited to share that we recently completed two Investing In Innovation (i3) (now EIR) evaluations for the Making Sense of SCIENCE program and the Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) programs. We thank the staff on both programs for their fantastic partnership. We also acknowledge Anne Wolf, our i3 technical assistance liaison from Abt Associates, as well as our Technical Working Group members on the Making Sense of SCIENCE project (Anne Chamberlain, Angela DeBarger, Heather Hill, Ellen Kisker, James Pellegrino, Rich Shavelson, Guillermo Solano-Flores, Steve Schneider, Jessaca Spybrook, and Fatih Unlu) for their invaluable contributions. Conducting these two large-scale, complex, multi-year evaluations over the last five years has not only given us the opportunity to learn much about both programs, but has also challenged our thinking—allowing us to grow as evaluators and researchers. We now reflect on some of the key lessons we learned, lessons that we hope will contribute to the field’s efforts in moving large-scale evaluations forward.

Background on Both Programs and Study Summaries

Making Sense of SCIENCE (developed by WestEd) is a teacher professional learning model aimed at increasing student achievement through improving instruction and supporting districts, schools, and teachers in their implementation of the Next Generation Science Standards (NGSS). The key components of the model include building leadership capacity and providing teacher professional learning. The program’s theory of action is based on the premise that professional learning that is situated in an environment of collaborative inquiry and supported by school and district leadership produces a cascade of effects on teachers’ content and pedagogical content knowledge, teachers’ attitudes and beliefs, the school climate, and students’ opportunities to learn. These effects, in turn, yield improvements in student achievement and other non-academic outcomes (e.g., enjoyment of science, self-efficacy, and agency in science learning). NGSS had just been introduced two years prior to the study, a study which ran from 2015 through 2018. The infancy of NGSS and the resulting shifting landscape of science education posed a significant challenge to our study, which we discuss below.

Our impact study of Making Sense of SCIENCE was a cluster-randomized, two-year evaluation involving more than 300 teachers and 8,000 students. Confirmatory impact analyses found a positive and statistically significant impact on teacher content knowledge. While impact results on student achievement were mostly all positive, none reached statistical significance. Exploratory analyses found positive impacts on teacher self-reports of time spent on science instruction, shifts in instructional practices, and amount of peer collaboration. Read our final report here.

CREATE is a three-year teacher residency program for students of Georgia State University College of Education and Human Development (GSU CEHD) that begins in their last year at GSU and continues through their first two years of teaching. The program seeks to raise student achievement by increasing teacher effectiveness and retention of both new and veteran educators by developing critically-conscious, compassionate, and skilled educators who are committed to teaching practices that prioritize racial justice and interrupt inequities.

Our impact study of CREATE used a quasi-experimental design to evaluate program effects for two staggered cohorts of study participants (CREATE and comparison early career teachers) from their final year at GSU CEHD through their second year of teaching, starting with the first cohort in 2015–16. Confirmatory impact analyses found no impact on teacher performance or on student achievement. However, exploratory analyses revealed a positive and statistically significant impact on continuous retention over a three-year time period (spanning graduation from GSU CEHD, entering teaching, and retention into the second year of teaching) for the CREATE group, compared to the comparison group. We also observed that higher continuous retention among Black educators in CREATE, relative to those in the comparison group, is the main driver of the favorable impact. The fact that the differential impacts on Black educators were positive and statistically significant for measures of executive functioning (resilience) and self-efficacy—and marginally statistically significant for stress management related to teaching—hints at potential mediators of impact on retention and guides future research.

After the i3 program funded this research, Empirical Education, GSU CEHD, and CREATE received two additional grants from the U.S. Department of Education’s Supporting Educator Effectiveness Development (SEED) program for further study of CREATE. We are currently studying our sixth cohort of CREATE residents and will have studied eight cohorts of CREATE residents, five cohorts of experienced educators, and two cohorts of cooperating teachers by the end of the second SEED grant. We are excited to continue our work with the GSU and CREATE teams and to explore the impact of CREATE, especially for retention of Black educators. Read our final report for the i3 evaluation of CREATE here.

Lessons Learned

While there were many lessons learned over the past five years, we’ll highlight two that were particularly challenging and possibly most pertinent to other evaluators.

The first key challenge that both studies faced was the availability of valid and reliable instruments to measure impact. For Making Sense of SCIENCE, a measure of student science achievement that was aligned with NGSS was difficult to identify because of the relative newness of the standards, which emphasized three-dimensional learning (disciplinary core ideas, science and engineering practices, and cross-cutting concepts). This multi-dimensional learning stood in stark contrast to the existing view of science education at the time, which primarily focused on science content. In 2014, one year prior to the start of our study, the National Research Council pointed out that “the assessments that are now in wide use were not designed to meet this vision of science proficiency and cannot readily be retrofitted to do so” (NRC, 2014, page 12). While state science assessments that existed at the time were valid and reliable, they focused on science content and did not measure the type of three-dimensional learning targeted by NGSS. The NRC also noted that developing new assessments would “present[s] complex conceptual, technical, and practical challenges, including cost and efficiency, obtaining reliable results from new assessment types, and developing complex tasks that are equitable for students across a wide range of demographic characteristics” (NRC, 2014, p.16).

Given this context, despite the research team’s extensive search for assessments from a variety of sources—including reaching out to state departments of education, university-affiliated assessment centers, and test developers—we could not find an appropriate instrument. Using state assessments was not an option. The states in our study were still in the process of either piloting or field testing assessments that were aligned to NGSS or to state standards based on NGSS. This void of assessments left the evaluation team with no choice but to develop one, independently of the program developer, using established items from multiple sources to address general specifications of NGSS, and relying on the deep content expertise of some members of the research team. Of course there were some risks associated with this, especially given the lack of opportunity to comprehensively pilot or field test the items in the context of the study. When used operationally, the researcher-developed assessment turned out to be difficult and was not highly discriminating of ability at the low end of the achievement scale, which may have influenced the small effect size we observed. The circumstances around the assessment and the need to improvise a measure leads us to interpret findings related to science achievement of the Making Sense of SCIENCE program with caution.

The CREATE evaluation also faced a measurement challenge. One of the two confirmatory outcomes in the study was teacher performance, as measured by ratings of teachers by school administrators on two of the state’s Teacher Assessment on Performance Standards (TAPS), which is a component of the state’s evaluation system (Georgia Department of Education, 2021). We could not detect impact on this measure because the variance observed in the ordinal ratings was remarkably low, with ratings overwhelmingly centered on the median value. This was not a complete surprise. The literature documents this lack of variability in teaching performance ratings. A seminal report, The Widget Effect by The New Teacher Project (Weisberg et al., 2009), called attention to this “national crisis”—the inability of schools to effectively differentiate among low- and high-performing teachers. The report showed that in districts that use binary evaluation ratings, as well as those that use a broader range of rating options, less than 1% of teachers received a rating of unsatisfactory. In the CREATE study, the median value was chosen overwhelmingly. In a study examining teacher performance ratings by Kraft and Gilmour (2017), principals in that study explained that they were more reluctant to give new teachers a rating below proficient because they acknowledge that new teachers were still working to improve their teaching, and that “giving a low rating to a potentially good teacher could be counterproductive to a teacher’s development.” These reasons are particularly relevant to the CREATE study given that the teachers in our study are very early in their teaching career (first year teachers), and given the high turnover rate of all teachers in Georgia.

We bring up this point about instruments as a way to share with the evaluation community what we see as a not uncommon challenge. In 2018 (the final year of outcomes data collection for Making Sense of SCIENCE), when we presented about the difficulties of finding a valid and reliable NGSS-aligned instrument at AERA, a handful of researchers approached us to commiserate; they too were experiencing similar challenges with finding an established NGSS-aligned instrument. As we write this, perhaps states and testing centers are further along in their development of NGSS-aligned assessments. However, the challenge of finding valid and reliable instruments, generally speaking, will persist as long as educational standards continue to evolve. (And they will.) Our response to this challenge was to be as transparent as possible about the instruments and the conclusions we can draw from using them. In reporting on Making Sense of SCIENCE, we provided detailed descriptions of our process for developing the instruments and reported item- and form-level statistics, as well as contextual information and rationale for critical decisions. In reporting on CREATE, we provided the distribution of ratings on the relevant dimensions of teacher performance for both the baseline and outcome measures. In being transparent, we allow the readers to draw their own conclusions from the data available, facilitate the review of the quality of the evidence against various sets of research standards, support replication of the study, and provide further context for future study.

A second challenge was maintaining a consistent sample over the course of the implementation, particularly in multi-year studies. For Making Sense of SCIENCE, which was conducted over two years, there was substantial teacher mobility into and out of the study. Given the reality of schools, even with study incentives, nearly half of teachers moved out of study schools or study-eligible grades within schools over the two year period of the study. This obviously presented a challenge to program implementation. WestEd delivered professional learning as intended, and leadership professional learning activities all met fidelity thresholds for attendance, with strong uptake of Making Sense of SCIENCE within each year (over 90% of teachers met fidelity thresholds). Yet, only slightly more than half of study teachers met the fidelity threshold for both years. The percentage of teachers leaving the school was congruous with what we observed at the national level: only 84% of teachers stay as a teacher at the same school year-over-year (McFarland et al., 2019). For assessing impacts, the effects of teacher mobility can be addressed to some extent at the analysis stage; however, the more important goal is to figure out ways to achieve fidelity of implementation and exposure for the full program duration. One option is to increase incentivization and try to get more buy-in, including among administration, to allow more teachers to reach the two-year participation targets by retaining teachers in subjects and grades to preserve their eligibility status in the study. This solution may go part way because teacher mobility is a reality. Another option is to adapt the program to make it shorter and more intensive. However, this option may work against the core model of the program’s implementation, which may require time for teachers to assimilate their learning. Yet another option is to make the program more adaptable; for example, by letting teachers who leave eligible grades and school to continue to participate remotely, allowing impacts to be assessed over more of the initially randomized sample.

For CREATE, sample size was also a challenge, but for slightly different reasons. During study design and recruitment, we had anticipated and factored the estimated level of attrition into the power analysis, and we successfully recruited the targeted number of teachers. However, several unexpected limitations arose during the study that ultimately resulted in small analytic samples. These limitations included challenges in obtaining research permission from districts and schools (which would have allowed participants to remain active in the study), as well as a loss of study participants due to life changes (e.g., obtaining teaching positions in other states, leaving the teaching profession completely, or feeling like they no longer had the time to complete data collection activities). Also, while Georgia administers the Milestones state assessment in grades 4–8, many participating teachers in both conditions taught lower elementary school grades or non-tested subjects. For the analysis phase, many factors resulted in small student samples: reduced teacher samples, the technical requirement of matching students across conditions within each cohort in order to meet WWC evidence standards, and the need to match students within grades, given the lack of vertically scaled scores. While we did achieve baseline equivalence between the CREATE and comparison groups for the analytic samples, the small number of cases greatly reduced the scope and external validity of the conclusions related to student achievement. The most robust samples were for retention outcomes. We have the most confidence in those results.

As a last point of reflection, we greatly enjoyed and benefited from the close collaboration with our partners on these projects. The research and program teams worked together in lockstep at many stages of the study. We also want to acknowledge the role that the i3 grant played in promoting the collaboration. For example, the grant’s requirements around the development and refinement of the logic model was a major driver of many collaborative efforts. Evaluators reminded the team periodically about the “accountability” requirements, such as ensuring consistency in the definition and use of the program components and mediators in the logic model. The program team, on the other hand, contributed contextual knowledge gained through decades of being intimately involved in the program. In the spirit of participatory evaluation, the two teams benefited from the type of organization learning that “occurs when cognitive systems and memories are developed and shared by members of the organizations” (Cousins & Earl, 1992). This type of organic and fluid relationship encouraged the researchers and program teams to embrace uncertainty during the study. While we “pre-registered” confirmatory research questions for both studies by submitting the study plans to NEi3 prior to the start of the studies, we allowed exploratory questions to be guided by conversations with the program developers. In doing so, we were able to address questions that were most useful to the program developers and the districts and schools implementing the programs.

We are thankful that we had the opportunity to conduct these two rigorous evaluations alongside such humble, thoughtful, and intentional (among other things!) program teams over the last five years, and we look forward to future collaborations. These two evaluations have both broadened and deepened our experience with large-scale evaluations, and we hope that our reflections here not only serve as lessons for us, but that they may also be useful to the education evaluation community at large, as we continue our work in the complex and dynamic education landscape.


Cousins, J. B., & Earl, L. M. (1992). The case for participatory evaluation. Educational Evaluation and Policy Analysis, 14(4), 397-418.

Georgia Department of Education (2021). Teacher Keys Effectiveness System.

Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234-249.

McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Forrest Cataldi, E., Bullock Mann, F., and Barmer, A. (2019). The Condition of Education 2019 (NCES 2019-144). U.S. Department of Education. National Center for Education Statistics.

National Research Council (NRC). (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. The National Academies Press.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New Teacher Project.


Instructional Coaching: Positive Impacts on Edtech Use and Student Learning

In 2019, Digital Promise contracted with Empirical Education to evaluate the impact of the Dynamic Learning Project (DLP) on teacher and student edtech usage and on student achievement. DLP provided school-based instructional technology coaches with mentoring and professional developing, with the goal to increase educational equity and impactful use of technology. You may have seen the blog post we published in summer 2020 announcing the release of our design memo for the study. The importance of this project was magnified during the pandemic-induced shift to an increased use of online tools. 

The results of the study are summarized in this research brief published last month. We found evidence of positive impacts on edtech use and student learning across three districts involved in DLP.  

These findings make a contribution to the evidence base for how to drive meaningful technology use in schools. This should continue to be an area of investigation for future studies; districts focused on equity and inclusion must ensure that edtech is adopted broadly across teacher and student populations.


We Won Two SEED Grants in 2020

Empirical Education began conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) in 2015 under a subcontract with Atlanta Neighborhood Charter Schools (ANCS) as part of their Investing in Innovation (i3) Development grant. Then, in 2018, we extended this work with CREATE and Georgia State University through the Supporting Effective Educator Development (SEED) Grant Program. And now, in 2020, we were just notified, that BOTH proposals we submitted to the SEED competition to further extend our work with CREATE were awarded grants!

One of the SEED grants is an extension to the one we received in 2018 that will allow us to continue the project for two additional years (through years 4 and 5).

The other SEED award will fund new work with CREATE and  Georgia State University by adding additional cohorts of CREATE residents and conducting a quasi-experiment measuring the effectiveness of CREATE for Cooperating Teachers (that is, the mentor teachers in whose classrooms residents are placed).  The study will examine impacts on teacher effectiveness, teacher retention, and student achievement, as well as other mediating outcomes. 


SREE 2020 Goes Virtual

We, like many of you, were excited to travel to Washington DC in March 2020 to present at the annual conference of the Society for Research on Educational Effectiveness (SREE). This would have been our 15th year attending or presenting at the SREE conference! We had been looking forward to learning from a variety of sessions and to sharing our own work with the SREE community, so imagine our disappointment when the conference was cancelled (rightfully) in response to the pandemic. Thankfully, SREE offered presenters the option to share their work virtually, and we are excited to have taken part in this opportunity!

Among the several accepted conference proposals, we decided to host the symposium on Social and Emotional Learning in Educational Settings & Academic Learning because it incorporated several of our major projects—three evaluations funded by the Department of Education’s i3/EIR program—two of which focus on teacher professional development and one that focuses on content enhancement routines and student content knowledge. We were joined by Katie Lass who presented on another i3/EIR evaluation conducted by the Policy & Research Group and by Anne Wolf, from Abt Associates, who served as the discussant. The presentations focused on unpacking the logic model for each of the respective programs and collectively, we tried to uncover common threads and lessons learned across the four i3/EIR evaluations.

We were happy to have a turnout that was more than we had hoped for and a rich discussion about the topic. The recording of our virtual symposium is now available here. Below are materials from each presentation.

We look forward to next year!

9A. Unpacking the Logic Model: A Discussion of Mediators and Antecedents of Educational Outcomes from the Investing in Innovation (i3) Program

Symposium: September 9, 1:00-2:00 PM EDT

Section: Social and Emotional Learning in Educational Settings & Academic Learning in Education Settings



Organizer: Katie Lass, Policy & Research Group

Impact on Antecedents of Student Dropout in a Cross-Age Peer Mentoring Program


Katie Lass, Policy & Research Group*; Sarah Walsh, Policy & Research Group; Eric Jenner, Policy & Research Group; and Sherry Barr, Center for Supportive Schools

Supporting Content-Area Learning in Biology and U.S. History: A Randomized Control Trial of Enhanced Units in California and Virginia


Hannah D’Apice, Empirical Education*; Adam Schellinger, Empirical Education; Jenna Zacamy, Empirical Education; Xin Wei, SRI International; and Andrew P. Jaciw, Empirical Education

The Role of Socioemotional Learning in Teacher Induction: A Longitudinal Study of the CREATE Teacher Residency Program


Audra Wingard, Empirical Education*; Andrew P. Jaciw, Empirical Education; Jenna Zacamy, Empirical Education

Uncovering the Black Box: Exploratory Mediation Analysis for a Science Teacher Professional Development Program


Thanh Nguyen, Empirical Education*; Andrew P. Jaciw, Empirical Education; and Jenna Zacamy, Empirical Education

Discussant: Anne Wolf, Abt Associates


Going Beyond the NCLB-Era to Reduce Achievement Gaps

We just published on Medium an important article that traces the recent history of education research to show how an unfortunate legacy of NCLB has weakened research methods, as applied to school use of edtech, and made invisible resulting achievement gaps. This article was originally a set of four blog posts by CEO Denis Newman and Chief Scientist Andrew Jaciw. The article shows how the legacy belief that differential subgroup effects (e.g., based on poverty, prior achievement, minority status, English proficiency) found in experiments are, at best, a secondary exploration that has left serious achievement gaps unexamined. And the false belief that only studies based on data collected before program implementation are free of misleading biases has given research the warranted reputation as very slow and costly. Instead, we present a rationale for low-cost and fast-turnaround studies using cloud-based edtech usage data combined with already collected school district administrative data. Working in districts that have already implemented the program lowers the cost to the point that a dozen small studies each examining subgroup effects, which Jaciw has shown to be relatively unbiased, can be combined to produce generalizable results. These results are what school decision-makers need in order to purchase edtech that works for all their students.

Read the article on medium here.

Or read the 4-part blog series we posted this past summer.

  1. Ending a Two-Decade Research Legacy

  2. ESSA Evidence Tiers and Potential for Bias

  3. Validating Research that Helps Reduce Achievement Gaps

  4. Putting Many Small Studies Together


Putting Many Small Studies Together

This is the last of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision makers. Here we show how lots of small studies can give better evidence to resolve achievement gaps. To read the the first 3 parts, use these links.

1. Ending a Two-Decade Research Legacy

2. ESSA Evidence Tiers and Potential for Bias

3. Validating Research that Helps Reduce Achievement Gaps

The NCLB-era of the single big study should be giving way to the analysis of the differential impacts for subgroups from multiple studies. This is the information that schools need in order to reduce achievement gaps. Today’s technology landscape is ready for this major shift in the research paradigm. The school shutdowns resulting from the COVID-19 pandemic have demonstrated that the value of edtech products goes beyond just the cost reduction of eliminating expensive print materials. Over the last decade digital learning products have collected usage data which provides rich and systematic evidence of how products are being used and by whom. At the same time, schools have accumulated huge databases of digital records on demographics and achievement history, with public data at a granularity down to the grade-level. Using today’s “big data” analytics, this wealth of information can be put to work for a radical reduction in the cost of showing efficacy.

Fast turnaround, low cost research studies will enable hundreds of studies to be conducted providing information to school decision-makers that answer their questions. Their questions are not just “which program, on average, produces the largest effect?” Their questions are “which program is most likely to work in my district, with my kids and teachers, and with my available resources, and which are most likely to reduce gaps of greatest concern?”

Meta-analysis is a method for combining multiple studies to increase generalizability (Shadish, Cook, & Campbell, 2002). With meta-analysis, we can test for stability of effects across sites and synthesize those results, where warranted, based on specific statistical criteria. While moderator analysis is considered merely exploratory in the NCLB-era, using meta-analysis, moderator results from multiple small studies, can in combination provide confirmation of a differential impact. Meta-analysis, or other approaches to research synthesis, combined with big data present new opportunities to move beyond the NCLB-era philosophy that prizes the single big study to prove the efficacy of a program.

While addressing WWC and ESSA standards, we caution, that a single study in one school district, or even several studies in several school districts, may not provide enough useful information to generalize to other school districts. For research to be the most effective, we need studies in enough districts to represent the full diversity of relevant populations. Studies need to systematically include moderator analysis for an effective way to generalize impact for subgroups.

The definitions provided in ESSA do not address how much information is needed to generalize from a particular study for implementation in other school districts. While we accept that well-designed Tier 2 or 3 studies are necessary to establish an appropriate level of rigor, we do not believe a single study is sufficient to declare a program will be effective across varied populations. We note that the Standards for Excellence in Education Research (SEER) recently adopted by the IES, call for facilitating generalizability.

After almost two decades of exclusive focus on the design of the single study we need to more effectively address achievement gaps with the specifics that school decision-makers need. Lowering the cost and turn-around time for research studies that break out subgroup results is entirely feasible. With enough studies qualified for meta-analysis, a new wealth of information will be available to educators who want to select the products that will best serve their students. This new order will democratize learning across the country, reducing inequities and raising student achievement in K-12 schools.