blog posts and news stories

Introducing Our Newest Researchers

The Empirical Research Team is pleased to announce the addition of 2 new team members. We welcome Zahava Heydel and Chelsey Nardi on board as our newest researchers!

Zahava Heydel, Research Assistant

Zahava has taken on assisting Sze-Shun Lau with the CREATE project, a teacher residency program in Atlanta Public Schools invested in expanding equity in education by developing critically conscious, compassionate, and skilled educators.  Zahava’s experience as a research assistant at the University of Colorado Anschutz Medical Campus Department of Psychiatry, Colorado Center for Women’s Behavioral Health and Wellness is an asset to the Empirical Education team as we move toward evaluating SEL programs and individual student needs.

Chelsey Nardi, Research Manager

Chelsey is taking on the role of co-project manager for our evaluation of the CREATE project, working with Sze-Shun and Zahava. Chelsey is currently working toward her PhD exploring the application of antiracist theories in science education, which may support the evaluation of CREATE’s mission to develop critically conscious educators. Additionally, her research experience at McREL International and REL Pacific as a Research and Evaluation Associate has prepared her for managing some of our REL Southwest applied research projects. These experiences, coupled with her experience in project management, makes her an ideal fit for our team.


Empirical Education Wraps Up Two Major i3 Research Studies

Empirical Education is excited to share that we recently completed two Investing In Innovation (i3) (now EIR) evaluations for the Making Sense of SCIENCE program and the Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) programs. We thank the staff on both programs for their fantastic partnership. We also acknowledge Anne Wolf, our i3 technical assistance liaison from Abt Associates, as well as our Technical Working Group members on the Making Sense of SCIENCE project (Anne Chamberlain, Angela DeBarger, Heather Hill, Ellen Kisker, James Pellegrino, Rich Shavelson, Guillermo Solano-Flores, Steve Schneider, Jessaca Spybrook, and Fatih Unlu) for their invaluable contributions. Conducting these two large-scale, complex, multi-year evaluations over the last five years has not only given us the opportunity to learn much about both programs, but has also challenged our thinking—allowing us to grow as evaluators and researchers. We now reflect on some of the key lessons we learned, lessons that we hope will contribute to the field’s efforts in moving large-scale evaluations forward.

Background on Both Programs and Study Summaries

Making Sense of SCIENCE (developed by WestEd) is a teacher professional learning model aimed at increasing student achievement through improving instruction and supporting districts, schools, and teachers in their implementation of the Next Generation Science Standards (NGSS). The key components of the model include building leadership capacity and providing teacher professional learning. The program’s theory of action is based on the premise that professional learning that is situated in an environment of collaborative inquiry and supported by school and district leadership produces a cascade of effects on teachers’ content and pedagogical content knowledge, teachers’ attitudes and beliefs, the school climate, and students’ opportunities to learn. These effects, in turn, yield improvements in student achievement and other non-academic outcomes (e.g., enjoyment of science, self-efficacy, and agency in science learning). NGSS had just been introduced two years prior to the study, a study which ran from 2015 through 2018. The infancy of NGSS and the resulting shifting landscape of science education posed a significant challenge to our study, which we discuss below.

Our impact study of Making Sense of SCIENCE was a cluster-randomized, two-year evaluation involving more than 300 teachers and 8,000 students. Confirmatory impact analyses found a positive and statistically significant impact on teacher content knowledge. While impact results on student achievement were mostly all positive, none reached statistical significance. Exploratory analyses found positive impacts on teacher self-reports of time spent on science instruction, shifts in instructional practices, and amount of peer collaboration. Read our final report here.

CREATE is a three-year teacher residency program for students of Georgia State University College of Education and Human Development (GSU CEHD) that begins in their last year at GSU and continues through their first two years of teaching. The program seeks to raise student achievement by increasing teacher effectiveness and retention of both new and veteran educators by developing critically-conscious, compassionate, and skilled educators who are committed to teaching practices that prioritize racial justice and interrupt inequities.

Our impact study of CREATE used a quasi-experimental design to evaluate program effects for two staggered cohorts of study participants (CREATE and comparison early career teachers) from their final year at GSU CEHD through their second year of teaching, starting with the first cohort in 2015–16. Confirmatory impact analyses found no impact on teacher performance or on student achievement. However, exploratory analyses revealed a positive and statistically significant impact on continuous retention over a three-year time period (spanning graduation from GSU CEHD, entering teaching, and retention into the second year of teaching) for the CREATE group, compared to the comparison group. We also observed that higher continuous retention among Black educators in CREATE, relative to those in the comparison group, is the main driver of the favorable impact. The fact that the differential impacts on Black educators were positive and statistically significant for measures of executive functioning (resilience) and self-efficacy—and marginally statistically significant for stress management related to teaching—hints at potential mediators of impact on retention and guides future research.

After the i3 program funded this research, Empirical Education, GSU CEHD, and CREATE received two additional grants from the U.S. Department of Education’s Supporting Educator Effectiveness Development (SEED) program for further study of CREATE. We are currently studying our sixth cohort of CREATE residents and will have studied eight cohorts of CREATE residents, five cohorts of experienced educators, and two cohorts of cooperating teachers by the end of the second SEED grant. We are excited to continue our work with the GSU and CREATE teams and to explore the impact of CREATE, especially for retention of Black educators. Read our final report for the i3 evaluation of CREATE here.

Lessons Learned

While there were many lessons learned over the past five years, we’ll highlight two that were particularly challenging and possibly most pertinent to other evaluators.

The first key challenge that both studies faced was the availability of valid and reliable instruments to measure impact. For Making Sense of SCIENCE, a measure of student science achievement that was aligned with NGSS was difficult to identify because of the relative newness of the standards, which emphasized three-dimensional learning (disciplinary core ideas, science and engineering practices, and cross-cutting concepts). This multi-dimensional learning stood in stark contrast to the existing view of science education at the time, which primarily focused on science content. In 2014, one year prior to the start of our study, the National Research Council pointed out that “the assessments that are now in wide use were not designed to meet this vision of science proficiency and cannot readily be retrofitted to do so” (NRC, 2014, page 12). While state science assessments that existed at the time were valid and reliable, they focused on science content and did not measure the type of three-dimensional learning targeted by NGSS. The NRC also noted that developing new assessments would “present[s] complex conceptual, technical, and practical challenges, including cost and efficiency, obtaining reliable results from new assessment types, and developing complex tasks that are equitable for students across a wide range of demographic characteristics” (NRC, 2014, p.16).

Given this context, despite the research team’s extensive search for assessments from a variety of sources—including reaching out to state departments of education, university-affiliated assessment centers, and test developers—we could not find an appropriate instrument. Using state assessments was not an option. The states in our study were still in the process of either piloting or field testing assessments that were aligned to NGSS or to state standards based on NGSS. This void of assessments left the evaluation team with no choice but to develop one, independently of the program developer, using established items from multiple sources to address general specifications of NGSS, and relying on the deep content expertise of some members of the research team. Of course there were some risks associated with this, especially given the lack of opportunity to comprehensively pilot or field test the items in the context of the study. When used operationally, the researcher-developed assessment turned out to be difficult and was not highly discriminating of ability at the low end of the achievement scale, which may have influenced the small effect size we observed. The circumstances around the assessment and the need to improvise a measure leads us to interpret findings related to science achievement of the Making Sense of SCIENCE program with caution.

The CREATE evaluation also faced a measurement challenge. One of the two confirmatory outcomes in the study was teacher performance, as measured by ratings of teachers by school administrators on two of the state’s Teacher Assessment on Performance Standards (TAPS), which is a component of the state’s evaluation system (Georgia Department of Education, 2021). We could not detect impact on this measure because the variance observed in the ordinal ratings was remarkably low, with ratings overwhelmingly centered on the median value. This was not a complete surprise. The literature documents this lack of variability in teaching performance ratings. A seminal report, The Widget Effect by The New Teacher Project (Weisberg et al., 2009), called attention to this “national crisis”—the inability of schools to effectively differentiate among low- and high-performing teachers. The report showed that in districts that use binary evaluation ratings, as well as those that use a broader range of rating options, less than 1% of teachers received a rating of unsatisfactory. In the CREATE study, the median value was chosen overwhelmingly. In a study examining teacher performance ratings by Kraft and Gilmour (2017), principals in that study explained that they were more reluctant to give new teachers a rating below proficient because they acknowledge that new teachers were still working to improve their teaching, and that “giving a low rating to a potentially good teacher could be counterproductive to a teacher’s development.” These reasons are particularly relevant to the CREATE study given that the teachers in our study are very early in their teaching career (first year teachers), and given the high turnover rate of all teachers in Georgia.

We bring up this point about instruments as a way to share with the evaluation community what we see as a not uncommon challenge. In 2018 (the final year of outcomes data collection for Making Sense of SCIENCE), when we presented about the difficulties of finding a valid and reliable NGSS-aligned instrument at AERA, a handful of researchers approached us to commiserate; they too were experiencing similar challenges with finding an established NGSS-aligned instrument. As we write this, perhaps states and testing centers are further along in their development of NGSS-aligned assessments. However, the challenge of finding valid and reliable instruments, generally speaking, will persist as long as educational standards continue to evolve. (And they will.) Our response to this challenge was to be as transparent as possible about the instruments and the conclusions we can draw from using them. In reporting on Making Sense of SCIENCE, we provided detailed descriptions of our process for developing the instruments and reported item- and form-level statistics, as well as contextual information and rationale for critical decisions. In reporting on CREATE, we provided the distribution of ratings on the relevant dimensions of teacher performance for both the baseline and outcome measures. In being transparent, we allow the readers to draw their own conclusions from the data available, facilitate the review of the quality of the evidence against various sets of research standards, support replication of the study, and provide further context for future study.

A second challenge was maintaining a consistent sample over the course of the implementation, particularly in multi-year studies. For Making Sense of SCIENCE, which was conducted over two years, there was substantial teacher mobility into and out of the study. Given the reality of schools, even with study incentives, nearly half of teachers moved out of study schools or study-eligible grades within schools over the two year period of the study. This obviously presented a challenge to program implementation. WestEd delivered professional learning as intended, and leadership professional learning activities all met fidelity thresholds for attendance, with strong uptake of Making Sense of SCIENCE within each year (over 90% of teachers met fidelity thresholds). Yet, only slightly more than half of study teachers met the fidelity threshold for both years. The percentage of teachers leaving the school was congruous with what we observed at the national level: only 84% of teachers stay as a teacher at the same school year-over-year (McFarland et al., 2019). For assessing impacts, the effects of teacher mobility can be addressed to some extent at the analysis stage; however, the more important goal is to figure out ways to achieve fidelity of implementation and exposure for the full program duration. One option is to increase incentivization and try to get more buy-in, including among administration, to allow more teachers to reach the two-year participation targets by retaining teachers in subjects and grades to preserve their eligibility status in the study. This solution may go part way because teacher mobility is a reality. Another option is to adapt the program to make it shorter and more intensive. However, this option may work against the core model of the program’s implementation, which may require time for teachers to assimilate their learning. Yet another option is to make the program more adaptable; for example, by letting teachers who leave eligible grades and school to continue to participate remotely, allowing impacts to be assessed over more of the initially randomized sample.

For CREATE, sample size was also a challenge, but for slightly different reasons. During study design and recruitment, we had anticipated and factored the estimated level of attrition into the power analysis, and we successfully recruited the targeted number of teachers. However, several unexpected limitations arose during the study that ultimately resulted in small analytic samples. These limitations included challenges in obtaining research permission from districts and schools (which would have allowed participants to remain active in the study), as well as a loss of study participants due to life changes (e.g., obtaining teaching positions in other states, leaving the teaching profession completely, or feeling like they no longer had the time to complete data collection activities). Also, while Georgia administers the Milestones state assessment in grades 4–8, many participating teachers in both conditions taught lower elementary school grades or non-tested subjects. For the analysis phase, many factors resulted in small student samples: reduced teacher samples, the technical requirement of matching students across conditions within each cohort in order to meet WWC evidence standards, and the need to match students within grades, given the lack of vertically scaled scores. While we did achieve baseline equivalence between the CREATE and comparison groups for the analytic samples, the small number of cases greatly reduced the scope and external validity of the conclusions related to student achievement. The most robust samples were for retention outcomes. We have the most confidence in those results.

As a last point of reflection, we greatly enjoyed and benefited from the close collaboration with our partners on these projects. The research and program teams worked together in lockstep at many stages of the study. We also want to acknowledge the role that the i3 grant played in promoting the collaboration. For example, the grant’s requirements around the development and refinement of the logic model was a major driver of many collaborative efforts. Evaluators reminded the team periodically about the “accountability” requirements, such as ensuring consistency in the definition and use of the program components and mediators in the logic model. The program team, on the other hand, contributed contextual knowledge gained through decades of being intimately involved in the program. In the spirit of participatory evaluation, the two teams benefited from the type of organization learning that “occurs when cognitive systems and memories are developed and shared by members of the organizations” (Cousins & Earl, 1992). This type of organic and fluid relationship encouraged the researchers and program teams to embrace uncertainty during the study. While we “pre-registered” confirmatory research questions for both studies by submitting the study plans to NEi3 prior to the start of the studies, we allowed exploratory questions to be guided by conversations with the program developers. In doing so, we were able to address questions that were most useful to the program developers and the districts and schools implementing the programs.

We are thankful that we had the opportunity to conduct these two rigorous evaluations alongside such humble, thoughtful, and intentional (among other things!) program teams over the last five years, and we look forward to future collaborations. These two evaluations have both broadened and deepened our experience with large-scale evaluations, and we hope that our reflections here not only serve as lessons for us, but that they may also be useful to the education evaluation community at large, as we continue our work in the complex and dynamic education landscape.


Cousins, J. B., & Earl, L. M. (1992). The case for participatory evaluation. Educational Evaluation and Policy Analysis, 14(4), 397-418.

Georgia Department of Education (2021). Teacher Keys Effectiveness System.

Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234-249.

McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Forrest Cataldi, E., Bullock Mann, F., and Barmer, A. (2019). The Condition of Education 2019 (NCES 2019-144). U.S. Department of Education. National Center for Education Statistics.

National Research Council (NRC). (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. The National Academies Press.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness. The New Teacher Project.


Instructional Coaching: Positive Impacts on Edtech Use and Student Learning

In 2019, Digital Promise contracted with Empirical Education to evaluate the impact of the Dynamic Learning Project (DLP) on teacher and student edtech usage and on student achievement. DLP provided school-based instructional technology coaches with mentoring and professional developing, with the goal to increase educational equity and impactful use of technology. You may have seen the blog post we published in summer 2020 announcing the release of our design memo for the study. The importance of this project was magnified during the pandemic-induced shift to an increased use of online tools. 

The results of the study are summarized in this research brief published last month. We found evidence of positive impacts on edtech use and student learning across three districts involved in DLP.  

These findings make a contribution to the evidence base for how to drive meaningful technology use in schools. This should continue to be an area of investigation for future studies; districts focused on equity and inclusion must ensure that edtech is adopted broadly across teacher and student populations.


We Won Two SEED Grants in 2020

Empirical Education began conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) in 2015 under a subcontract with Atlanta Neighborhood Charter Schools (ANCS) as part of their Investing in Innovation (i3) Development grant. Then, in 2018, we extended this work with CREATE and Georgia State University through the Supporting Effective Educator Development (SEED) Grant Program. And now, in 2020, we were just notified, that BOTH proposals we submitted to the SEED competition to further extend our work with CREATE were awarded grants!

One of the SEED grants is an extension to the one we received in 2018 that will allow us to continue the project for two additional years (through years 4 and 5).

The other SEED award will fund new work with CREATE and  Georgia State University by adding additional cohorts of CREATE residents and conducting a quasi-experiment measuring the effectiveness of CREATE for Cooperating Teachers (that is, the mentor teachers in whose classrooms residents are placed).  The study will examine impacts on teacher effectiveness, teacher retention, and student achievement, as well as other mediating outcomes. 


SREE 2020 Goes Virtual

We, like many of you, were excited to travel to Washington DC in March 2020 to present at the annual conference of the Society for Research on Educational Effectiveness (SREE). This would have been our 15th year attending or presenting at the SREE conference! We had been looking forward to learning from a variety of sessions and to sharing our own work with the SREE community, so imagine our disappointment when the conference was cancelled (rightfully) in response to the pandemic. Thankfully, SREE offered presenters the option to share their work virtually, and we are excited to have taken part in this opportunity!

Among the several accepted conference proposals, we decided to host the symposium on Social and Emotional Learning in Educational Settings & Academic Learning because it incorporated several of our major projects—three evaluations funded by the Department of Education’s i3/EIR program—two of which focus on teacher professional development and one that focuses on content enhancement routines and student content knowledge. We were joined by Katie Lass who presented on another i3/EIR evaluation conducted by the Policy & Research Group and by Anne Wolf, from Abt Associates, who served as the discussant. The presentations focused on unpacking the logic model for each of the respective programs and collectively, we tried to uncover common threads and lessons learned across the four i3/EIR evaluations.

We were happy to have a turnout that was more than we had hoped for and a rich discussion about the topic. The recording of our virtual symposium is now available here. Below are materials from each presentation.

We look forward to next year!

9A. Unpacking the Logic Model: A Discussion of Mediators and Antecedents of Educational Outcomes from the Investing in Innovation (i3) Program

Symposium: September 9, 1:00-2:00 PM EDT

Section: Social and Emotional Learning in Educational Settings & Academic Learning in Education Settings



Organizer: Katie Lass, Policy & Research Group

Impact on Antecedents of Student Dropout in a Cross-Age Peer Mentoring Program


Katie Lass, Policy & Research Group*; Sarah Walsh, Policy & Research Group; Eric Jenner, Policy & Research Group; and Sherry Barr, Center for Supportive Schools

Supporting Content-Area Learning in Biology and U.S. History: A Randomized Control Trial of Enhanced Units in California and Virginia


Hannah D’Apice, Empirical Education*; Adam Schellinger, Empirical Education; Jenna Zacamy, Empirical Education; Xin Wei, SRI International; and Andrew P. Jaciw, Empirical Education

The Role of Socioemotional Learning in Teacher Induction: A Longitudinal Study of the CREATE Teacher Residency Program


Audra Wingard, Empirical Education*; Andrew P. Jaciw, Empirical Education; Jenna Zacamy, Empirical Education

Uncovering the Black Box: Exploratory Mediation Analysis for a Science Teacher Professional Development Program


Thanh Nguyen, Empirical Education*; Andrew P. Jaciw, Empirical Education; and Jenna Zacamy, Empirical Education

Discussant: Anne Wolf, Abt Associates


Going Beyond the NCLB-Era to Reduce Achievement Gaps

We just published on Medium an important article that traces the recent history of education research to show how an unfortunate legacy of NCLB has weakened research methods, as applied to school use of edtech, and made invisible resulting achievement gaps. This article was originally a set of four blog posts by CEO Denis Newman and Chief Scientist Andrew Jaciw. The article shows how the legacy belief that differential subgroup effects (e.g., based on poverty, prior achievement, minority status, English proficiency) found in experiments are, at best, a secondary exploration that has left serious achievement gaps unexamined. And the false belief that only studies based on data collected before program implementation are free of misleading biases has given research the warranted reputation as very slow and costly. Instead, we present a rationale for low-cost and fast-turnaround studies using cloud-based edtech usage data combined with already collected school district administrative data. Working in districts that have already implemented the program lowers the cost to the point that a dozen small studies each examining subgroup effects, which Jaciw has shown to be relatively unbiased, can be combined to produce generalizable results. These results are what school decision-makers need in order to purchase edtech that works for all their students.

Read the article on medium here.

Or read the 4-part blog series we posted this past summer.

  1. Ending a Two-Decade Research Legacy

  2. ESSA Evidence Tiers and Potential for Bias

  3. Validating Research that Helps Reduce Achievement Gaps

  4. Putting Many Small Studies Together


Putting Many Small Studies Together

This is the last of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision makers. Here we show how lots of small studies can give better evidence to resolve achievement gaps. To read the the first 3 parts, use these links.

1. Ending a Two-Decade Research Legacy

2. ESSA Evidence Tiers and Potential for Bias

3. Validating Research that Helps Reduce Achievement Gaps

The NCLB-era of the single big study should be giving way to the analysis of the differential impacts for subgroups from multiple studies. This is the information that schools need in order to reduce achievement gaps. Today’s technology landscape is ready for this major shift in the research paradigm. The school shutdowns resulting from the COVID-19 pandemic have demonstrated that the value of edtech products goes beyond just the cost reduction of eliminating expensive print materials. Over the last decade digital learning products have collected usage data which provides rich and systematic evidence of how products are being used and by whom. At the same time, schools have accumulated huge databases of digital records on demographics and achievement history, with public data at a granularity down to the grade-level. Using today’s “big data” analytics, this wealth of information can be put to work for a radical reduction in the cost of showing efficacy.

Fast turnaround, low cost research studies will enable hundreds of studies to be conducted providing information to school decision-makers that answer their questions. Their questions are not just “which program, on average, produces the largest effect?” Their questions are “which program is most likely to work in my district, with my kids and teachers, and with my available resources, and which are most likely to reduce gaps of greatest concern?”

Meta-analysis is a method for combining multiple studies to increase generalizability (Shadish, Cook, & Campbell, 2002). With meta-analysis, we can test for stability of effects across sites and synthesize those results, where warranted, based on specific statistical criteria. While moderator analysis is considered merely exploratory in the NCLB-era, using meta-analysis, moderator results from multiple small studies, can in combination provide confirmation of a differential impact. Meta-analysis, or other approaches to research synthesis, combined with big data present new opportunities to move beyond the NCLB-era philosophy that prizes the single big study to prove the efficacy of a program.

While addressing WWC and ESSA standards, we caution, that a single study in one school district, or even several studies in several school districts, may not provide enough useful information to generalize to other school districts. For research to be the most effective, we need studies in enough districts to represent the full diversity of relevant populations. Studies need to systematically include moderator analysis for an effective way to generalize impact for subgroups.

The definitions provided in ESSA do not address how much information is needed to generalize from a particular study for implementation in other school districts. While we accept that well-designed Tier 2 or 3 studies are necessary to establish an appropriate level of rigor, we do not believe a single study is sufficient to declare a program will be effective across varied populations. We note that the Standards for Excellence in Education Research (SEER) recently adopted by the IES, call for facilitating generalizability.

After almost two decades of exclusive focus on the design of the single study we need to more effectively address achievement gaps with the specifics that school decision-makers need. Lowering the cost and turn-around time for research studies that break out subgroup results is entirely feasible. With enough studies qualified for meta-analysis, a new wealth of information will be available to educators who want to select the products that will best serve their students. This new order will democratize learning across the country, reducing inequities and raising student achievement in K-12 schools.


Validating Research that Helps Reduce Achievement Gaps

This is the third of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we show how issues of bias affecting the NCLB-era average impact estimates are not necessarily inherited by the differential subgroup estimates. (Read the first and the second parts of the story here.)

There’s a joke among researchers: “When a study is conducted in Brooklyn and in San Diego the average results should apply to Wichita.”

Following the NCLB-era rules, researchers usually put all their resources for a study into the primary result, providing a report on the average effectiveness of a product or tool across all populations. This leads to disregarding subgroup differences and misses the opportunity to discover that the program studied may work better or worse for certain populations. A particularly strong example is from our own work where this philosophy led to a misleading conclusion. While the program we studied was celebrated as working on average, it turned out not to help the Black kids. It widened an existing achievement gap. Our point is that differential impacts are not just extras, they should be the essential results of school research.

In many cases we find that a program works well for some students and not so much for others. In all cases the question is whether the program increases or decreases an existing gap. Researchers call this differential impact an interaction between the characteristics of the people in the study and the program, or they call it a moderated impact as in: the program effect is moderated by the characteristic. If the goal is to narrow an achievement gap, the difference between subgroups (for example: English language learners, kids in free lunch programs, or girls versus boys) in the impact provides the most useful information.

Examining differential impacts across subgroups also turns out to be less subject to the kinds of bias that have concerned NCLB-era researchers. In a recent paper, Andrew Jaciw showed that with matched comparison designs, estimation of the differential effect of a program on contrasting subgroups of individuals can be less susceptible to bias than the researcher’s estimate of the average effect. Moderator effects are less prone to certain forms of selection bias. In this work, he develops the idea that when evaluating differential effects using matched comparison studies that involve cross-site comparisons, standard selection bias is “differenced away” or negated. While a different form of bias may be introduced, he shows empirically that it is considerably smaller. This is a compelling and surprising result and speaks to the importance of making moderator effects a much greater part of impact evaluations. Jaciw finds that the differential effects for subgroups do not necessarily inherit the biases that are found in average effects, which were the core focus of the NCLB era.

Therefore, matched comparison studies, may be less biased than one might think for certain important quantities. On the other hand, RCTs, which are often believed to be without bias (thus, “gold standard”) may be biased in ways that are often overlooked. For instance, results from RCTs may be limited by selection based on who chooses to participate. The teachers and schools who agree to be part of the RCT might bias results in favor of those more willing to take risks and try new things. In that case, the results wouldn’t generalize to less adventurous teachers and schools.

A general advantage of RCEs (in the form of matched comparison experiments) have over RCTs is they can be conducted under more true-to-life circumstances. If using existing data, outcomes reflect results from field implementations as they happened in real life. Such RCEs can be performed more quickly and at a lower cost than RCTs. These can be used by school districts, which have paid for a pilot implementation of a product and want to know in June whether the program should be expanded in September. The key to this kind of quick turn-around, rapid-cycle evaluation is to use data from the just completed school year rather than following the NCLB-era habit of identifying schools that have never implemented the program and assigning teachers as users and non-users before implementation begins. Tools, such as the RCE Coach (now the Evidence to Insights Coach) and Evidentally’s Evidence Suite, are being developed to support district-driven as well as developer-driven matched comparisons.

Commercial publishers have also come under criticism for potential bias. The Hechinger Report recently published an article entitled: Ed tech companies promise results, but their claims are often based on shoddy research. Also from the Hechinger Report, Jill Barshay offered a critique entitled The dark side of education research: widespread bias. She cites a working paper that was recently published in a well-regarded journal by a Johns Hopkins team led by Betsy Wolf, who worked with reports of studies within the WWC database (all using either RCTs or matched comparisons). Wolf compared differences in results where some studies were paid for by the program developer and others paid for through independent sources (such as IES grants to research organizations). Wolf’s study found the size of the effect based on source of funding substantially favored developer sponsored studies. The most likely explanation was that developers are more prone to avoid publishing unfavorable results. This is called a “developer effect.” While we don’t doubt that Wolf found a real difference, the interpretation and importance of the bias can be questioned.

First while more selective reporting by developers may bias their reporting upward, other biases may lead to smaller effects for independently-funded research. Following NCLB-era rules, independently-funded researchers must convince school districts to use the new materials or programs to be studied. But many developer-driven studies are conducted where the materials or program being studied is already in use (or being considered for use and established as a good fit for adoption). The bias to the average overall effect size from a lack of inherent interest might result in a lower effect estimate.

When a developer is budgeting for an evaluation, working in a district that has already invested in the program and succeeded in its implementation is often the best way to provide the information that other districts need since it shows not just the outcomes but a case study of an implementation. Results from a district that has chosen a program for a pilot and succeeded in its implementation may not be an unfair bias. While the developer selected districts with successful pilots may score higher than districts recruited by an independently funded researcher, they are also more likely to have commonalities with districts interested in adopting the program. Recruiting schools with no experience with the program may bias the results to be lower than they should be.

Second, the fact that bias was found in the standard NCLB-era average between the user and non-user groups provides another reason to drop the primacy of the overall average and put our focus on the subgroup moderator analysis where there may be less bias. Average outcomes across all populations has little information value for school district decision-makers. Moderator effects are what they need if their goal is to reduce rather than enlarge an achievement gap.

We have no reason to assume that the information school decision-makers need has inherited the same biases that have been demonstrated in the use of developer-driven Tier 2 and 3 studies in evaluation of programs. We do see that the NCLB-era habit of ignoring subgroup differences reinforces the status quo and hides achievement gaps in K-12 schools.

In the next and final portion of this four-part blog series, we advocate replacing the NCLB-era single study with meta-analyses of many studies where the focus is on the moderators.


ESSA’s Evidence Tiers and Potential for Bias

This is the second of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we explain how ESSA introduced flexibility and how NCLB-era habits have raised issues about bias. (Read the first one here.)

The tiers of evidence defined in the Every Student Succeeds Act (ESSA) give schools and researchers greater flexibility, but are not without controversy. Flexibility creates the opportunity for biased results. The ESSA law, for example, states that studies must statistically control for “selection bias”, recognizing that teachers who “select” to use a program may have other characteristics that give them an advantage and the results for those teachers could be biased upward. As we trace the problem of bias it is useful to go back to the interpretation of ESSA that originated with the NCLB-era approach to research.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based educational products that automatically report usage data, it is important to clarify both ESSA’s useful advances and how the four tiers fail to address a critical scientific concept needed for schools to make use of research.

We’ve written elsewhere how the ESSA tiers of evidence form a developmental scale. The four tiers give educators as well as developers of educational materials and products an easier way to start examining effectiveness without making the commitment to the type of scientifically-based research that NCLB once required.

We think of the four tiers of evidence defined in ESSA as a pyramid as shown in this figure.

ESSA levels of evidence pyramid

  1. RCT. At the apex is Tier 1, defined by ESSA as a randomized control trial (RCT), considered the gold standard in the NCLB era.
  2. Matched Comparison or “quasi-experiments”. With Tier 2 the WWC also allowed for less rigorous experimental research design, such as matched comparisons or quasi-experiments (QE) where schools, teachers, and students (experimental units) independently chose to engage in the program. QEs are permitted but accepted “with reservations” because without random assignment there is the possibility of “selection bias.” For example, teachers who do well at preparing kids for tests might be more likely to participate in a new program than teachers who don’t excel at test preparation. With an RCT we can expect that such positive traits are equally distributed in the experiment between users and non-users.
  3. Correlational. Tier 3 is an important and useful addition to evidence, as a weaker but readily achieved method once the developer has a product running in schools. At that point, they have an opportunity to see if critical elements of the program correlate with outcomes of interest. This provides promising evidence, which is useful for both improving the product and giving the schools some indication that it is helping. This evidence suggests that it might be worthwhile to follow up with a tier 2 study for more definitive results.
  4. Rationale. The base level or Tier 4 is the expectation that any product should have a rationale based on learning science for why it is likely to work. Schools will want this basic rationale for why a program should work before trying it out. Our colleagues at Digital Promise have announced a service in which developers are certified as meeting Tier 4 standards.

Each subsequent tier of evidence (from number 4 to 1) improves what’s considered the “rigor” of the research design. It is important to understand that the hierarchy has nothing to do with whether the results can be generalized from the setting of the study to the district where the decision-maker resides.

While the NCLB-era focus on strong design puts emphasis on the Tier 1 RCT, we see Tiers 2 and 3 as an opportunity for lower cost and faster-turn-around “rapid-cycle evaluations” (RCE.) Tier 1 RCTs have given education research a well-deserved reputation as slow and expensive. It can take one to two years to complete an RCT, with additional time needed for data collection, analysis, and reporting. This extensive work also includes recruiting districts that are willing to participate in the RCT and often puts the cost of the study in the millions of dollars. We have conducted dozens of RCTs following the NCLB-era rules, but advocate less expensive studies in order to get the volume of evidence schools need. In contrast to an RCT, an RCE can use existing data from a school system can be both faster and far less expensive.

There is some controversy about whether schools should use lower-tier evidence, which might be subject to “selection bias.” Randomized control trials are protected from selection bias since users and non-user are assigned randomly, whether they like it or not. It is well known and has been recently pointed out by Robert Slavin that using a matched comparison, a study where teachers chose to participate in the pilot of a product, can result in unmeasured variables, technically “confounders” that affect outcomes. These variables are associated with the qualities that motivate a teacher to pursue pilot studies and their desire to excel in teaching. The comparison group may lack these characteristics that help the self-selected program users succeed. Studies of Tiers 2 and 3 will always have, by definition, unmeasured variables that may act as confounders.

While obviously a concern, there are ways that researchers can statistically control important characteristics associated with selection to use a program. For example, the amount of a teacher’s motivation to use edtech products can be controlled by collecting information from the prior year on the amount of usage by the teacher and students of a full set of products. Past studies looking at the conditions under which there is correspondence between results of RCTs and matched comparison studies that evaluate the impact of a given program have established that it is exactly “focal” variables such as motivation, that are influential confounders. Controlling for a teacher’s demonstrated motivation and students’ incoming achievement may go very far in adjusting away bias. We suggest this in a design memo for a study now being undertaken. This statistical control meets the ESSA requirement for Tiers 2 and 3.

We have a more radical proposal for controlling all kinds of bias that we address in the next posting in this series.


Agile Assessment and the Impact of Formative Testing on Student Achievement in Algebra

Empirical Education contracted with Jefferson Education Accelerator (JEA) to conduct a study on the effectiveness of formative testing for improving student achievement in Algebra. We partnered with a large urban school district in the northeast U.S. to evaluate their use of Agile Assessment. Developed by experts at the Charles A Dana Center at the University of Texas and education company Agile Mind, Agile Assessment is a flexible system for developing, administering, and analyzing student assessments that are aligned by standard, reading level, and level of difficulty. The district used benchmark Agile Assessments in the fall, winter, and spring to assess student performance in Algebra along with a curriculum it had chosen independent of assessments.

We conducted a quasi-experimental comparison group study using data from the 2016-17 school year and examined the impact of Agile Assessment usage on student achievement for roughly 1,000 students using the state standardized assessment in Algebra.

There were three main findings from the study:

  1. Algebra scores for students who used Agile Assessment were better than scores of comparison students. The result had an effect size of .30 (p = .01), which corresponds to a 12-percentile point gain, adjusting for differences in student demographics and pretest between treatment and comparison students.
  2. The positive impact of Agile Assessment generalized across many student subgroups, including Hispanic students, economically disadvantaged students and special education students.
  3. Outcomes on the state Algebra assessment were positively associated with the average score on the Agile Assessment benchmark tests. That said, adding the average score on Agile Assessment benchmark tests to the linear model increased its predictive power by a small amount.

These findings provide valuable evidence in favor of formative testing for the district and other stakeholders. Given disruptions in the current public school paradigm, increased frequency of formative assessment could provide visibility towards greater personalized instruction and ultimately increase student outcomes. You can read the full research report here.