blog posts and news stories

Evidentally Rejoins Empirical Education

On 10/31/18, we formed Evidentally from a set of projects that Empirical Education called Evidence as a Service (EaaS). The idea was to build a set of products that would automate much of the labor-intensive portions of the research process, e.g., statistical analysis, data cleaning, and similar efforts. Building on the education technology (edtech) efficiencies, particularly the collection of edtech usage data, would make it possible for non-researchers such as school administrators to conduct efficacy research.

By lowering the cost of research and introducing an ease into the research process, we are able to increase the number of valid studies that could be combined in a meta-analysis for generalizable results. The notion was that, as a product company, Evidentally could attempt to get investments unavailable to services companies such as Empirical Education. Unfortunately, for several reasons, Evidentally was unable to get that investment.

The intellectual property and projects of Evidentally have been returned to Empirical Education as of 12/7/22, and Evidentally (as its own entity) was dissolved. While the team is still committed to building the Evidence as a Service suite of tools, it will be conducted as a project of Empirical Education, under the branding of the Evidentally Evidence Suite. The Evidentally product is just one piece of Empirical Education’s education evidence offerings, when an education application or curriculum needs evidence of efficacy meeting any of the Every Students Succeeds Act (ESSA) Tiers of evidence.


Happy New Year from Empirical Education

To ring in the new year, we want to share this two-minute video with you. It comprises highlights from 2022 from each person on our team. We hope you like it. Cheers to a healthy and prosperous 2023!

My colleagues appear in this order in the video.

Happy New Year photo by Sincerely Media


Studying the Impacts of CAPIT Reading: An Early Literacy Program in Oklahoma

Empirical Education’s Evidentally recently conducted a study to evaluate the impact of CAPIT Reading on student early literacy achievement. The study utilized a quasi-experimental comparison group design using data from 12 elementary schools in a suburban school district in Oklahoma during the 2019–20 school year.

CAPIT Reading is a comprehensive PK–2 literacy solution that includes a digital phonics curriculum and teacher professional development. The program is a teacher-led phonemic awareness and phonics curriculum that includes lesson plans, built-in assessments, and ongoing support.

Four schools used CAPIT to supplement their literacy instruction for kindergarten students (treatment group) while eight schools did not (comparison group). The study linked CAPIT usage data and district demographic and achievement data to estimate the impact of CAPIT on the Letter Word Sounds Fluency (LWSF) and Early Literacy Composite scores of the aimsweb reading assessment, administered by the district in August and January.

We found a positive impact of CAPIT Reading on student early reading achievement on the aimsweb assessment for kindergarten students. This positive impact was estimated at 4.4 test score points for the aimsweb Early Literacy Composite score (effect size = 0.17; p = 0.01) and 7.8 points for the LWSF score (effect size = 0.29; p < 0.001). This impact on the LWSF score is equivalent to a 29% increase in growth for the average CAPIT student from the fall to winter tests.

We found limited evidence of differential impact favoring student subgroups, meaning that this positive impact for CAPIT users did not vary according to student characteristics such as eligibility for free and reduced-price lunch, race, or gender. We found that the impact on aimsweb overall was marginally greater for special education students by 4.9 points (p = 0.09) and that the impact on LWSF scores was greater for English Language Learners by 7.4 points (p = 0.09). Impact of CAPIT reading does not vary significantly across other student groups.

Read the CAPIT Reading Student Impact Report for more information on this early literacy research.


SREE 2022 Annual Meeting

When I read the theme of the 2022 SREE Conference, “Reckoning to Racial Justice: Centering Underserved Communities in Research on Educational Effectiveness”, I was eager to learn more about the important work happening in our community. The conference made it clear that SREE researchers are becoming increasingly aware of the need to swap individual-level variables for system-level variables that better characterize issues of systematic access and privilege. I was also excited that many SREE researchers are pulling from the fields of mixed methods and critical race theory to foster more equity-aligned study designs, such as those that center participant voice and elevate counter-narratives.

I’m excited to share a few highlights from each day of the conference.

Wednesday, September 21, 2022

Dr. Kamilah B. Legette, University of Denver

Dr. Kamilah B Legette masked and presenting at SREE

Dr. Kamilah B. Legette from the University of Denver discussed their research exploring the relationship between a student’s race and teacher perceptions of the student’s behavior as a) severe, b) inappropriate, and c) indicative of patterned behavior. In their study, 22 teachers were asked to read vignettes describing non-compliant student behaviors (e.g., disrupting storytime) where student identity was varied by using names that are stereotypically gendered and Black (e.g., Jazmine, Darnell) or White (e.g., Katie, Cody).

Multilevel modeling revealed that while student race did not predict teacher perceptions of behavior as severe, inappropriate, or patterned, students’ race was a moderator of the strength of the relationship between teachers’ emotions and perceptions of severe and patterned behavior. Specifically, the relationship between feelings of frustration and severe behavior was stronger for Black children than for White children, and the relationship between feelings of anger and patterned behavior showed the same pattern. Dr. Legette’s work highlighted a need for teachers to engage in reflective practices to unpack these biases.

Dr. Johari Harris, University of Virginia

In the same session, Dr. Johari Harris from the University of Virginia shared their work with the Children’s Defense Fund Freedom Schools. Learning for All (LFA), one Freedom School for students in grades 3-5, offers a five-week virtual summer literacy program with a culturally responsive curriculum based on developmental science. The program aims to create humanizing spaces that (re)define and (re)affirm Black students’ racial-ethnic identities, while also increasing students’ literacy skills, motivation, and engagement.

Dr. Harris’s mixed methods research found that students felt LFA promoted equity and inclusion, and reported greater participation, relevance, and enjoyment within LFA compared to in-person learning environments prior to COVID-19. They also felt their teachers were culturally engaging, and reported a greater sense of belonging, desire to learn, and enjoyment.

While it’s often assumed that young children of color are not fully aware of their racial-ethnic identity or how it is situated within a White supremacist society, Dr. Harris’s work demonstrated the importance of offering culturally affirming spaces to upper-elementary aged students.

Thursday, September 22, 2022

Dr. Krystal Thomas, SRI

Dr. Krystal Thomas presenting at SREE

On Thursday, I attended a talk by Dr. Krystal Thomas from SRI International about the potential of open education resource (OER) programming to further culturally responsive and sustaining practices (CRSP). Their team developed a rubric to analyze OER programming, including materials and professional development (PD) opportunities. The rubric combined principles of OER (free and open access to materials, student-generated knowledge) and CRSP (critical consciousness, student agency, student ownership, inclusive content, classroom culture, and high academic standards).

Findings suggest that while OER offers access to quality instructional materials, it does not necessarily develop teacher capacity to employ CRSP. The team also found that some OER developers charge for CRSP PD, which undermines a primary goal of OER (i.e., open access). One opportunity this talk provided was eventual access to a rubric to analyze critical consciousness in program materials and professional learning (Dr. Thomas said these materials will be posted on the SRI website in upcoming months). I believe this rubric may support equity-driven research and evaluation, including Empirical’s evaluation of the antiracist teacher residency program, CREATE (Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness).

Dr. Rekha Balu, Urban Institute; Dr. Sean Reardon, Stanford University; Dr. Beth Boulay, Abt Associates

left to right: Dr. Beth Boulay, Dr. Rekha Balu, Dr. Sean Reardon, and Titilola Harley on stage at SREE

The plenary talk, featuring discussants Dr. Rekha Balu, Dr. Sean Reardon, and Dr. Beth Boulay, offered suggestions for designing equity- and action-driven effectiveness studies. Dr. Balu urged the SREE community to undertake “projects of a lifetime”. These are long-haul initiatives that push for structural change in search of racial justice. Dr. Balu argued that we could move away from typical thinking about race as a “control variable”, towards thinking about race as an experience, a system, and a structure.

Dr. Balu noted the necessity of mixed methods and participant-driven approaches to serve this goal. Along these same lines, Dr. Reardon felt we need to consider system-level inputs (e.g., school funding) and system-level outputs (e.g., rate of high school graduation) in order to understand disparities in opportunity, rather than just focusing on individual-level factors (e.g., teacher effectiveness, student GPA, parent involvement) that distract from larger forces of inequity. Dr. Boulay noted the importance of causal evidence to persuade key gatekeepers to pursue equity initiatives and called for more high quality measures to serve that goal.

Friday, September 23, 2022

The tone of the conference on Friday was to call people in (a phrase used in opposition to “call people out”, which is often ego-driven, alienating, and counter-productive to motivating change).

Dr. Ivory Toldson, Howard University

Dr. Ivory Toldson at a podium presenting at SREE

In the morning, I attended the Keynote Session by Dr. Ivory Toldson from Howard University. What stuck with me from Dr. Toldson’s talk was their argument that we tend to use numbers as a proxy for people in statistical models, but to avoid some of the racism inherent in our profession as researchers, we must see numbers as people. Dr. Toldson urged the audience to use people to understand numbers, not numbers to understand people. In other words, by deriving a statistical outcome, we do not necessarily know more about the people we study. However, we are equipped with a conversation starter. For example, if Dr. Toldson hadn’t invited Black boys to voice their own experience of why they sometimes struggle in school, they may have never drawn a potential link between sleep deprivation and ADHD diagnosis: a huge departure from the traditional deficit narrative surrounding Black boys in school.

Dr. Toldson also challenged us to consider what our choice in the reference group means in real terms. When we use White students as the reference group, we normalize Whiteness and we normalize groups with the most power. This impacts not only the conclusions we draw, but also the larger framework in which we operate (i.e., White = standard, good, normal).

I also appreciated Dr. Toldson’s commentary on the need for “distributive trust” in schools. They questioned why the people furthest from the students (e.g., superintendents, principals) are given the most power to name best practices, rather than empowering teachers to do what they know works best and to report back. This thought led me to wonder, what can we do as researchers to lend power to teachers and students? Not in a performative way, but in a way that improves our research by honoring their beliefs and first-hand experiences; how can we engage them as knowledgeable partners who should be driving the narrative of effectiveness work?

Dr. Deborah Lindo, Dr. Karin Lange, Adam Smith, EF+Math Program; Jenny Bradbury, Digital Promise; Jeanette Franklin, New York City DOE

Later in the day, I attended a session about building research programs on a foundation of equity. Folks from EF+Math Program (Dr. Deborah Lindo, Dr. Karin Lange, and Dr. Adam Smith), Digital Promise (Jenny Bradbury), and the New York City DOE (Jeanette Franklin) introduced us to some ideas for implementing inclusive research, including a) fostering participant ownership of research initiatives; b) valuing participant expertise in research design; c) co-designing research in partnership with communities and participants; d) elevating participant voice, experiential data, and other non-traditional effectiveness data (e.g., “street data”); and e) putting relationships before research design and outcomes. As the panel noted, racism and inequity are products of design and can be redesigned. More equitable research practices can be one way of doing that.

Saturday, September 24, 2022

Dr. Andrew Jaciw, Empirical Education

Dr. Andrew Jaciw at a podium presenting at SREE

On Saturday, I sat in on a session that included a talk given by my colleague Dr. Andrew Jaciw. Instead of relaying my own interpretation of Andrew’s ideas and the values they bring to the SREE community, I’ll just note that he will summarize the ideas and insights from his talk and subsequent discussion in an upcoming blog. Keep your eyes open for that!

See you next year!

Dr. Chelsey Nardi and Dr. Leanne Doughty


Navigating the Tensions: How Could Equity-Relevant Research Also Be Agile, Open, and Scalable?

Our SEERNet partnership with Digital Promise is working to connect platform developers, researchers, and educators to find ways to conduct equity-relevant research using well-used digital learning platforms, and to simultaneously conduct research that is more agile, more open, and more directly applicable at scale. To do this researchers may have to rethink how they plan and undertake their research. We wrote a paper identifying five approaches that could better support this work.

  1. Reframe research designs to form smaller, agile cycles that test small changes each time.
  2. Researchers could shift from designing new educational resources to determining how well-used resources could be elaborated and refined to address equity issues.
  3. Researchers could utilize variables that capture student experiences to investigate equity when they cannot obtain student demographic/identify variables.
  4. Researchers could work in partnership with educators on equity problems that educators prioritize and want help in solving.
  5. Researchers could acknowledge that achieving equity is not only a technological or resource-design problem, but requires working at the classroom and systems levels too.

We hope that this paper (Navigating the Tensions: How Could Equity-Relevant Research Also Be Agile, Open, and Scalable?) will provide insights and ideas for researchers in the SEERNet community.

Read the paper here.


Evidentally is a finalist in the XPRIZE Digital Learning Challenge

The XPRIZE Digital Learning Challenge encourages applicants to develop innovative approaches to “modernize, accelerate, and improve effective learning tools, processes and outcomes” for all learners. The overarching goal of this type of research is to increase equity by identifying education products that work with different subgroups of students. Seeing the Institute of Education Sciences (IES) move in this direction provides hope for the future of education research and our students.

For those of you who have known us for the last 5-10 years, you may be aware that we’ve been working towards this future of low-cost, quick turnaround studies for quite some time. To be completely transparent, I had never even heard of XPRIZE before IES funded one of their competitions.

Given our excitement about this IES-funded competition, we knew we had to throw our Evidentally hat into the ring. Evidentally is the part of Empirical Education—formerly known as Evidence as a Service (EaaS)—that has been producing low-cost, quick turnaround research reports for edtech clients for the past 5 years.

Of the 33 teams who entered the XPRIZE competition, we are excited to announce that we are one of the 10 finalists. We look forward to seeing how this competition helps to pave the road to scalable education research.


AERA 2022 Annual Meeting

We really enjoyed being back in-person at AERA for our first time in a few years. We missed that interaction and collaboration that can truly only come from talking and engaging in person. The theme this year—Cultivating Equitable Education Systems for the 21st Century—is highly aligned with our company goals, and we value and appreciate the extraordinary work of so many researchers and practitioners who are dedicated to discovering equitable educational solutions.

We met some of the team from ICPSR, the Inter-university Consortium for Political and Social Research, and had a chance to learn about their guide to social science data preparation and archiving. We attended too many presentations to talk about so we’ll highlight a few below that stood out to us.

Thursday, April 21, 2022

On Thursday, Sze-Shun Lau and Jenna Zacamy presented the impacts of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) on the continuous retention of teachers through their second year. The presentation was part of a roundtable discussion with Jacob Elmore, Dirk Richter, Eric Richter, Christin Lucksnat, and Stefanie Marshall.

It was a pleasure to hear about the work coming out of the University of Potsdam around examining the connections between extraversion levels of alternatively-certified teachers and their job satisfaction and student achievement, and about opportunities for early-career teachers at the University of Minnesota to be part of learning communities with whom they can openly discuss racialized matters in school settings and develop their racial consciousness. We also had the opportunity to engage in conversation with our fellow presenters about constructive supports for early-career teachers that place value on the experiences and motivating factors they bring to the table, and other commonalities in our work aiming to increase retention of teachers in diverse contexts.

Friday, April 22, 2022

Our favorite session we attended on Friday was a paper session titled Critical and Transformative Perspectives on Social and Emotional Learning. In this session, Dr. Dena Simmons, shared her paper titled Educator Freedom Dreams: Humanizing Social and Emotional Learning Through Racial Justice and talked about SEL as an approach to alleviate the stressors of systemic racism from a Critical Race Theory education perspective.

We tweeted about it from AERA.

Another interesting session from Friday was about the future of IES research. Jenna sat in on a small group discussion around the proposed future topic areas of IES competitions. We are most interested in if/how IES will implement the recommendation to have a “systematic, periodic, and transparent process for analyzing the state of the field and adding or removing topics as appropriate”.

Saturday, April 23, 2022

On Saturday morning, there was a symposium titled Revolutionary Love: The Role of Antiracism in Affirming the Literacies of Black and Latinx Elementary Youth. The speakers talked about the three tenants of providing thick, revolutionary love to students: believing, knowing, and doing.

speakers from the Revolutionary Love symposium

Saturday afternoon, in a presidential session titled Beyond Stopping Hate: Cultivating Safe, Equitable and Affirming Educational Spaces for Asian/Asian American Students, we heard CSU Assistant Professor Edward Curammeng give crucial advice to researchers: “We need to read outside our fields, we need to re-read what we think we’ve already read, and we need to engage Asian American voices in our research.”

After our weekend at AERA, we returned home refreshed and thinking about the importance of making sure students and teachers see themselves in their school contexts - Dr. Simmons provided a crucial reminder that remaining neutral and failing to integrate the sociopolitical contexts of educational issues only furthers erasure. As our evaluation of CREATE continues, we plan to incorporate some of the great feedback we received at our roundtable session, including further exploring the motivation that led our study participants to enter the teaching profession, and how their work with CREATE adds fuel to those motivations.

Did you attend the annual AERA meeting this year? Tell us about your favorite session or something that got you thinking.


McGraw Hill Education ALEKS Study Published

We worked with McGraw Hill in 2021 to evaluate the effect of ALEKS, an adaptive program for math and science, in California and Arizona. These School Impact reports, like all of our reports, were designed to meet The Every Student Succeeds Act (ESSA) evidence standards.

During this process of working with McGraw Hill, we found evidence that the implementation of ALEKS in Arizona school districts during the 2018-2019 school year had a positive effect on the AzMERIT End of Course Algebra I and Algebra II assessments, especially for students from historically disadvantaged populations. This School Impact report—meeting ESSA evidence tier 3: Promising Evidence—identifies the school-level effects of active ALEKS usage on achievement compared to similar AZ schools not using ALEKS.

Please visit our McGraw Hill webpage to read the ALEKS Arizona School Impact report.

What is ESSA?

For more information on ESSA in education and how Empirical Education incorporates the ESSA standards into our work, check out our ESSA page.


Presenting CREATE at AERA in April 2022

Attending AERA 2022

We’re finally returning to in-person conferences after the COVID-related hiatus, and we will be presenting at the annual meeting of the American Educational Research Association (AERA). This year, the AERA meeting will be held in San Diego, our CEO Denis Newman’s new home turf since relocating to Encinitas from Palo Alto, CA.

Sze-Shun Lau and Jenna Zacamy will be attending AERA in person, joined by Andrew Jaciw virtually, to present impacts of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) on the continuous retention of teachers through their second year.

  • When: Thursday, April 21, from 2:30 to 4:00pm PDT
  • Where: San Diego Convention Center, Exhibit Hall B
  • AERA Roundtable session: Retaining Teachers for Diverse Contexts
  • AERA Presentation: Impacts of “Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness” on the Continuous Retention of Teachers Through Their Second Year

Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE)

The work that is the basis of our AERA presentation examines the impact of CREATE—a teacher induction program—on graduation and subsequent retention of teachers through their first two years. The matched comparison group design involved 121 teachers across two cohorts. Positive impacts on retention rates were observed among Black educators only.

Retention rates after two years of teaching were 71% for non-Black educators in both CREATE and comparison groups. For Black educators the rates were 96% and 63% in CREATE and comparison, respectively. Positive impacts on mediators among Black educators, including stress-management and self-efficacy in teaching, provide a preliminary explanation of effects.

We have been exploring potential mechanisms for these impacts by posing open-ended survey questions to teachers about teacher retention. Based on their own conversations, experiences, and observations, early career teachers have cited rigid teaching standards, heavy and mentally taxing workloads, a lack of support from administration, and the low pay as common reasons teachers in their first three years of teaching leave the profession.

Factors that these teachers see as effective in retaining early-career teachers include recognition of the importance of representation in the classroom and motivation to work towards building less oppressive systems for their students. For early career teachers participating in CREATE, the access to professional learning around communication skills, changing one’s mindset, and addressing inequities are credited as potential drivers of higher retention rates.

We look forward to presenting these and other themes that have emerged from the responses these teachers provided.

We would be delighted to see you in San Diego if you’re planning to attend AERA. Let us know if we can schedule a time to meet up.

Photo by Lucas Davies


Towards Greater (Local) Relevance of Causal Generalizations

To cite the paper we discuss in this blog post, use the reference below.

Jaciw, A. P., Unlu, F., & Nguyen, T. (2021). A within-study approach to evaluating the role of moderators of impact in limiting generalizations from “large to small”. American Journal of Evaluation.

Generalizability of Causal Inferences

The field of education has made much progress over the past 20 years in the use of rigorous methods, such as randomized experiments, for evaluating causal impacts of programs. This includes a growing number of studies on the generalizability of causal inferences stemming from the recognition of the prevalence of impact heterogeneity and its sources (Steiner et al., 2019). Most recent work on generalizability of causal inferences has focused on inferences from “small to large”. Studies typically include 30–70 schools while generalizations are made to inference populations at least ten times larger (Tipton et al., 2017). Such studies are typically used in informing decision makers concerned with impacts on broad scales, for example at the state level. However, as we are periodically reminded by the likes of Cronbach (1975, 1982) and Shadish et al. (2002), generalizations are of many types and support decisions on different levels. Causal inferences may be generalized not only to populations outside the study sample or to larger populations, but also to subgroups within the study sample and to smaller groups – even down to the individual! In practice, district and school officials who need local interpretations of the evidence might ask: “If a school reform effort demonstrates positive impact on some large scale, should I, as a principal, expect that the reform will have positive impact on the students in my school?” Our work introduces a new approach (or a new application of an old approach) to address questions of this type. We empirically evaluate how well causal inferences that are drawn on the large scale generalize to smaller scales.

The Research Method

We adapt a method from studies traditionally used (first in economics and then in education) to empirically measure the accuracy of program impact estimates from non-experiments. A central question is whether specific strategies result in better alignment between non-experimental impact findings and experimental benchmarks. Those studies—sometimes referred to as “Within-Study Comparison” studies (pioneered by Lalonde, 1986, and Fraker et al., 1987)—typically start with an estimate of a program’s impact from an uncompromised experiment. This result serves as the benchmark experimental impact finding. Then, to generate a non-experimental result, outcomes from the experimental control are replaced with those from a different comparison group. The difference in impact that results from this substitution measures the bias (inaccuracy) in the result that employs the non-experimental comparison. Researchers typically summarize this bias, and then try to remediate using various design and analysis-based strategies. (The Within-Study Comparison literature is vast and includes many studies that we cite in the article.)

Our Approach Follows a Within-Study Comparison Rationale and Method, but with a Focus on Generalizability.

We use data from the multisite Tennessee Student-Teacher Achievement Ratio (STAR) class size reduction experiment (described in Finn et al., 1990; Mosteller, 1995; Nye et al., 2000) to illustrate the application of our method. (We used 73 of the original 79 sites.) In the original study, students and teachers were randomized to small or regular-sized classes in grades K-3. Results showed a positive average impact of small classes. In our study, we ask whether a decisionmaker at a given site should accept this finding of an overall average positive impact as generalizable to his/her individual site.

We use the Within-Study Comparison Method as a Foundation.

First, we adopt the idea of using experimental benchmark impacts as the starting point. In the case of the STAR trial, each of the 73 individual sites yields its own benchmark value for impact. Second, consistent with Within-Study Comparisons, we select an alternative to compare against the benchmark. Specifically, we choose the average of impacts (the grand mean) across all sites as the generalized value. Third, we establish how closely this generalized value approximates impacts at individual sites (i.e., how well it generalizes “to the small”.) With STAR, we can do this 73 times, once for each site. Fourth, we summarize the discrepancies. Standard Within-Study Comparison methods typically average over the absolute values of individual biases. We adapt this, but instead use the average of 73 squared differences between the generalized impact and site-benchmark impacts. This allows us to capture the average discrepancy as a variance, specifically as the variation in impact across sites. We estimated this variation several ways, using alternative hierarchical linear models. Finally, we examine whether adjusting for imbalance between sites in site-level characteristics that potentially interact with treatment leads to closer alignment between the grand mean (generalized) and site-specific impacts. (Sometimes people wonder why, with Within-Study Comparison studies, if site-specific benchmark impacts are available, one would use less-optimal comparison group-based alternatives. With Within-Study Comparisons, the whole point is to see how closely we can replicate the benchmark quantity, in order to inform how well methods of causal inference (of generalization, in this case) potentially perform, in situations where we do not have an experimental benchmark.)

Our application is intentionally based on Within-Study Comparison methods. This is set out clearly in Jaciw (2010, 2016). Early applications with a similar approach can be found in Hotz, et al. (2005) and Hotz, et al. (2006). A new contribution of ours is that we summarize the discrepancy not as an average of absolute value of bias (a common metric in Within-Study Comparison studies), but as noted above, as a variance. This may sound like a nuanced technical detail, but we think it leads to an important interpretation: variation in impact is not just related to the problem of generalizability; rather, it directly indexes the accuracy (quantifies the degree of validity) of generalizations from “large to small”. We acknowledge Bloom et al. (2005) for the impetus for this idea, specifically, their insight that bias in Within-Study Comparison studies can be thought of as a type of “mismatch error”. Finally, we think it is important to acknowledge the ideas in G Theory from education (Cronbach et al., 1963; Shavelson et al., 2009). In that tradition, parsing variability in outcomes, accounting for its sources, and assessing the role of interactions among study factors, are central to the problem of generalizability.

Research Findings

First main result

The grand mean impact, on average, does not generalize reliably to the 73 sites. Before covariate adjustments, the average of the differences between the grand mean and the impacts at individual sites ranges between 0.41 and 0.25 standard deviations (SDs) of the outcome distribution, depending on the model used. After covariate adjustments, the average of the differences ranges between 0.41 and 0.17 SDs. (The average impact was about 0.25 SD.)

Second main result

Modeling effects of site-level covariates, and their interactions with treatment, only minimally reduced the between-site differences in impact.

The third main result

Whether impact heterogeneity achieves statistical significance depends on sampling error and correctly accounting for its sources. If we are going to provide accurate policy advice, we must make sure that we are not confusing random sampling error within sites (differences we would expect in results even if the program was not used) for variation in impact across sites. One source of random sampling error that is important but could be overlooked comes from classes. Given that teachers provide different value-added to students’ learning, we can expect differences in outcomes across classes. In STAR, with only a handful of teachers per school, the between-class differences easily add noise to the between-school outcomes and impacts. After adjusting for class random effects, the discrepancies in impact described above decreased by approximately 40%.

Research Conclusions

For the STAR experiment, the grand mean impact failed to generalize to individual sites. Adjusting for effects of moderators did not help much. Adjusting for class-level sampling error significantly reduced the level of measured heterogeneity. Even though the discrepancies decreased significantly after the class effects were included, the size of the discrepancies remained large enough to be substantively important, and therefore, we cannot conclude that the average impact generalized to individual sites.

In sum, based on this study, a policymaker at the site (school) level should apply caution in assessing whether the average result applies to his or her unique context.

The results remind us of an observation from Lee Cronbach (1982) about how a school board might best draw inferences about their local context serving a large Hispanic student body when program effects vary:

The school board might therefore do better to look at…small cities, cities with a large Hispanic minority, cities with well-trained teachers, and so on. Several interpretations-by-analogy can then be made….If these several conclusions are not too discordant, the board can have some confidence in the decision that it makes about its small city with well-trained teachers and a Hispanic clientele. When results in the various slices of data are dissimilar, it is better to try to understand the variation than to take the well-determined – but only remotely relevant – national average as the best available information. The school board cannot regard that average as superior information unless it believes that district characteristics do not matter (p. 167).

Some Possible Extensions of The Work

We’re looking forward to doing more work to continue to understand how to produce useful generalizations that support decision-making on smaller scales. Traditional Within-Study Comparison studies give us much food for thought, including about other designs and analysis strategies for inferring impacts to individual sites, and how to best communicate the discrepancies we observe and whether they are substantively large enough to matter for informing policy decisions and outcomes. One area of main interest concerns the quality of the moderators themselves; that is, how well they account for or explain impact heterogeneity. Here our approach diverges from traditional Within-Study Comparison studies. When applied to problems of internal validity, confounders can be seen as nuisances that make our impact estimates inaccurate. With regard to external validity, factors that interact with the treatment, and thereby produce variation in impact that affects generalizability, are not a nuisance; rather, they are an important source of information that may help us to understand the mechanisms through which the variation in impact occurs. Therefore, understanding the mechanisms relating the person, the program, context, and the outcome is key.

Lee Cronbach described the bounty of and interrelations among interactions in the social sciences as a “hall of mirrors”. We’re looking forward to continuing the careful journey along that hall to incrementally make sense of a complex world!


Bloom, H. S., Michalopoulos, C., & Hill, C. J. (2005). Using experiments to assess nonexperimental comparison -group methods for measuring program effect. In H. S. Bloom (Ed.), Learning more from social experiments (pp. 173 –235). Russell Sage Foundation.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30(2), 116–127.

Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137-163.

Cronbach, L. J. (1982). Designing Evaluations of Educational and Social Programs. Jossey-Bass.

Finn, J. D., & Achilles, C. M., (1990). Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27, 557-577.

Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Jaciw, A. P. (2010). Challenges to drawing generalized causal inferences in educational research: Methodological and philosophical considerations. [Doctoral dissertation, Stanford University.]

Jaciw, A. P. (2016). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, 40, 199-240.

Hotz, V. J., Imbens, G. W., & Klerman, J. A. (2006). Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the California GAIN Program. Journal of Labor Economics, 24, 521–566.

Hotz, V. J., Imbens, G. W. & Mortimer, J. H (2005). Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125, 241–270.

Lalonde, R. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76, 604–620.

Mosteller, F., (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5, 113-127.

Nye, B., Hedges, L. V., & Konstantopoulos, (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37, 123-151.

Fraker, T., & Maynard, R. (1987). The adequacy of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources, 22, 194–227.

Shadish, W. R., Cook, T. D., & Campbell, D. T., (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Houghton Mifflin.

Shavelson, R. J., & Webb, N. M. (2009). Generalizability theory and its contributions to the discussion of the generalizability of research findings. In K. Ercikan & W. M. Roth (Eds.), Generalizing from educational research (pp. 13–32). Routledge.

Steiner, P. M., Wong, V. C. & Anglin, K. (2019). A causal replication framework for designing and assessing replication efforts. Zeitschrift fur Psychologie, 227, 280–292.

Tipton, E., Hallberg, K., Hedges, L. V., & Chan, W. (2017). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 41(5), 472–505.

Jaciw, A. P., Unlu, F., & Nguyen, T. (2021). A within-study approach to evaluating the role of moderators of impact in limiting generalizations from “large to small”. American Journal of Evaluation.

Photo by drmakete lab