blog posts and news stories

Five-year evaluation of Reading Apprenticeship i3 implementation reported at SREE

Empirical Education has released two research reports on the scale-up and impact of Reading Apprenticeship, as implemented under one of the first cohorts of Investing in Innovation (i3) grants. The Reading Apprenticeship Improving Secondary Education (RAISE) project reached approximately 2,800 teachers in five states with a program providing teacher professional development in content literacy in three disciplines: science, history, and English language arts. RAISE supported Empirical Education and our partner, IMPAQ International, in evaluating the innovation through both a randomized control trial encompassing 42 schools and a systematic study of the scale-up of 239 schools. The RCT found significant impact on student achievement in science classes consistent with prior studies. Mean impact across subjects, while positive, did not reach the .05 level of significance. The scale-up study found evidence that the strategy of building cross-disciplinary teacher teams within the school is associated with growth and sustainability of the program. Both sides of the evaluation were presented at the annual conference of the Society for Research on Educational Effectiveness, March 6-8, 2016 in Washington DC. Cheri Fancsali (formerly of IMPAQ, now at Research Alliance for NYC Schools) presented results of the RCT. Denis Newman (Empirical) presented a comparison of RAISE as instantiated in the RCT and scale-up contexts.

You can access the reports and research summaries from the studies using the links below.
RAISE RCT research report
RAISE RCT research summary
RAISE Scale-up research report
RAISE Scale-up research summary

2016-03-09

SREE Spring 2016 Conference Presentations

We are excited to be presenting two topics at the annual Spring Conference of The Society for Research on Educational Effectiveness (SREE) next week. Our first presentation addresses the problem of using multiple pieces of evidence to support decisions. Our second presentation compares the context of an RCT with schools implementing the same program without those constraints. If you’re at SREE, we hope to run into you, either at one of these presentations (details below) or at one of yours.

Friday, March 4, 2016 from 3:30 - 5PM
Roosevelt (“TR”) - Ritz-Carlton Hotel, Ballroom Level

6E. Evaluating Educational Policies and Programs
Evidence-Based Decision-Making and Continuous Improvement

Chair: Robin Wisniewski, RTI International

Does “What Works”, Work for Me?: Translating Causal Impact Findings from Multiple RCTs of a Program to Support Decision-Making
Andrew P. Jaciw, Denis Newman, Val Lazarev, & Boya Ma, Empirical Education



Saturday, March 5, 2016 from 10AM - 12PM
Culpeper - Fairmont Hotel, Ballroom Level

Session 8F: Evaluating Educational Policies and Programs & International Perspectives on Educational Effectiveness
The Challenge of Scale: Evidence from Charters, Vouchers, and i3

Chair: Ash Vasudeva, Bill & Melinda Gates Foundation

Comparing a Program Implemented under the Constraints of an RCT and in the Wild
Denis Newman, Valeriy Lazarev, & Jenna Zacamy, Empirical Education

2016-02-26

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.

2014-05-21

Empirical Education Presents Initial Results from i3 Validation Grant Evaluation

Our director of research operations, Jenna Zacamy, joined Cheri Fancsali from IMPAQ International and Cyndy Greenleaf from the Strategic Literacy Initiative (SLI) at WestEd at the Literacy Research Association (LRA) conference in Dallas, TX on December 4. Together, they conducted a symposium, which was the first formal presentation of findings from the Investing in Innovation (i3) Validation grant, Reading Apprenticeship Improving Secondary Education (RAISE). WestEd won the grant in 2010 with Empirical Education and IMPAQ serving as the evaluators. There are two ongoing evaluations: the first includes a randomized control trial (RCT) of over 40 schools in California and Pennsylvania investigating the impact of Reading Apprenticeship on teacher instructional practices and student achievement; the second is a formative evaluation spanning four states and 150+ schools investigating how the school systems build capacity to implement and disseminate Reading Apprenticeship and sustain these efforts. The symposium’s discussant, P. David Pearson (UC Berkeley), provided praise of the design and effort of both studies stating that he has “never seen such thoughtful and extensive evaluations.” Preliminary findings from the RCT show that Reading Apprenticeship teachers provide students more opportunities to practice metacognitive strategies and foster and support more student collaboration opportunities. Findings from the second year of the formative evaluation suggest high levels of buy-in and commitment from both school administrators and teachers, but also identify competing initiatives and priorities as a primary challenge to sustainability. Initial findings of our five-year, multi-state study of RAISE are promising, but reflect the real-world complexity of scaling up and evaluating literacy initiatives across several contexts. Final results from both studies will be available in 2015.

View the information presented at LRA here and here.

2013-12-19

Empirical Presents about Aspire Public School’s t3 System at AEA 2013

Empirical Education presented at the annual conference of the American Evaluation Association (AEA) in Washington, DC. Our newest research manager, Kristen Koue, along with our chief scientist, Andrew Jaciw reflected on striking the right balance between conducting a rigorous randomized control trial that meets i3 grant parameters, while also conducting an implementation evaluation that provides useful formative feedback to the Aspire population.

2013-10-15

Study Shows a “Singapore Math” Curriculum Can Improve Student Problem Solving Skills

A study of HMH Math in Focus (MIF) released today by research firm Empirical Education Inc. demonstrates a positive impact of the curriculum on Clark County School District elementary students’ math problem solving skills. The 2011-2012 study was contracted by the publisher, which left the design, conduct, and reporting to Empirical. MIF provides elementary math instruction based on the pedagogical approach used in Singapore. The MIF approach to instruction is designed to support conceptual understanding, and is said to be closely aligned with the Common Core State Standards (CCSS), which focuses more on in-depth learning than previous math standards.

Empirical found an increase in math problem solving among students taught with HMH Math in Focus compared to their peers. The Clark County School District teachers also reported an increase in their students’ conceptual understanding, as well as an increase in student confidence and engagement while explaining and solving math problems. The study addressed the difference between the CCSS-oriented MIF and the existing Nevada math standards and content. While MIF students performed comparatively better on complex problem solving skills, researchers found that students in the MIF group performed no better than the students in the control group on the measure of math procedures and computation skills. There was also no significant difference between the groups on the state CRT assessment, which has not fully shifted over to the CCSS.

The research used a group randomized control trial to examine the performance of students in grades 3-5 during the 2011-2012 school year. Each grade-level team was randomly assigned to either the treatment group that used MIF or the control group that used the conventional math curriculum. Researchers used three different assessments to capture math achievement contrasting procedural and problem solving skills. Additionally, the research design employed teacher survey data to conduct mediator analyses (correlations between percentage of math standards covered and student math achievement) and assess fidelity of classroom implementation.

You can download the report and research summary from the study using the links below.
Math in Focus research report
Math in Focus research summary

2013-04-01

Empirical Releases Final Report on HMH Fuse™ iPad App

Today Empirical and Houghton Mifflin Harcourt made the following announcement. You can download the report and research summary from the study using the links below.
Fuse research report
Fuse research summary

Study Shows HMH Fuse™ iPad® App Can Dramatically Improve Student Achievement

Strong implementation in Riverside Unified School District associated with nine-point increase in percentile standing

BOSTON – April 10, 2012 – A study of HMH Fuse: Algebra 1 app released today by research firm Empirical Education Inc. identifies implementation as a key factor in the success of mobile technology. The 2010–2011 study was a pilot of a new educational app from global education leader Houghton Mifflin Harcourt (HMH) that re-imagines the conventional textbook to fully deploy interactive features of the mobile device. The HMH Fuse platform encourages the use of personalized lesson plans by combining direct instruction, ongoing support, assessment and intervention in one easy-to-use suite of tools.

Empirical found that the iPad-using students in the four participating districts: Long Beach, Fresno, San Francisco and Riverside Unified School District (Riverside Unified), performed on average as well as their peers using the traditional textbook. However, after examining its own results, Riverside Unified found an increase in test scores among students taught with HMH Fuse compared to their peers. Empirical corroborated these results, finding a statistically significant impact equivalent to a nine-point percentile increase. The Riverside Unified teachers also reported substantially greater usage of the HMH Fuse app both in teaching and by the students in class.

“Education technology does not operate in a vacuum, and the research findings reinforce that with a supportive school culture and strategic implementation, technology can have a significant impact on student achievement,” said Linda Zecher, President and CEO of HMH. “We’re encouraged by the results of the study and the potential of mobile learning to accelerate student achievement and deepen understanding in difficult to teach subjects like algebra.”

Across all districts, the study found a positive effect on student attitudes toward math, and those students with positive attitudes toward math achieved higher scores on the California Standards Test.

The research design was a “gold standard” randomized control trial that examined the performance of eighth-grade students during the 2010-2011 school year. Each teacher’s classes were randomly assigned to either the treatment group that used the HMH Fuse app or the control group that used the conventional print format of the same content.

“The rapid pace of mobile technology’s introduction into K-12 education leaves many educators with important questions about its efficacy especially given their own resources and experience,” said Denis Newman, CEO of Empirical Education. “The results from Riverside highlight the importance of future research on mobile technologies that account for differences in teacher experience and implementation.”

To access the full research report, go to www.empiricaleducation.com. A white paper detailing the implementation and impact of HMH Fuse in Riverside is available on the HMH website.

2012-04-10

The Value of Looking at Local Results

The report we released today has an interesting history that shows the value of looking beyond the initial results of an experiment. Later this week, we are presenting a paper at AERA entitled “In School Settings, Are All RCTs Exploratory?” The findings we report from our experiment with an iPad application were part of the inspiration for this. If Riverside Unified had not looked at its own data, we would not, in the normal course of data analysis, have broken the results out by individual districts, and our conclusion would have been that there was no discernible impact of the app. We can cite many other cases where looking at subgroups leads us to conclusions different from the conclusion based on the result averaged across the whole sample. Our report on AMSTI is another case we will cite in our AERA paper.

We agree with the Institute of Education Sciences (IES) in taking a disciplined approach in requiring that researchers “call their shots” by naming the small number of outcomes considered most important in any experiment. All other questions are fine to look at but fall into the category of exploratory work. What we want to guard against, however, is the implication that answers to primary questions, which often are concerned with average impacts for the study sample as a whole, must apply to various subgroups within the sample, and therefore can be broadly generalized by practitioners, developers, and policy makers.

If we find an average impact but in exploratory analysis discover plausible, policy-relevant, and statistically strong differential effects for subgroups, then some doubt about completeness may be cast on the value of the confirmatory finding. We may not be certain of a moderator effect—for example—but once it comes to light, the value of the average impact can also be considered incomplete or misleading for practical purposes. If it is necessary to conduct an additional experiment to verify a differential subgroup impact, the same experiment may verify that the average impact is not what practitioners, developers, and policy makers should be concerned with.

In our paper at AERA, we are proposing that any result from a school-based experiment should be treated as provisional by practitioners, developers, and policy makers. The results of RCTs can be very useful, but the challenges of generalizability of the results from even the most stringently designed experiment mean that the results should be considered the basis for a hypothesis that the intervention may work under similar conditions. For a developer considering how to improve an intervention, the specific conditions under which it appeared to work or not work is the critical information to have. For a school system decision maker, the most useful pieces of information are insight into subpopulations that appear to benefit and conditions that are favorable for implementation. For those concerned with educational policy, it is often the case that conditions and interventions change and develop more rapidly than research studies can be conducted. Using available evidence may mean digging through studies that have confirmatory results in contexts similar or different from their own and examining exploratory analyses that provide useful hints as to the most productive steps to take next. The practitioner in this case is in a similar position to the researcher considering the design of the next experiment. The practitioner also has to come to a hypothesis about how things work as the basis for action.

2012-04-01

Empirical is participating in recently awarded five-year REL contracts

The Institute of Education Sciences (IES) at the U.S. Department of Education recently announced the recipients of five-year contracts for each of the 10 Regional Education Laboratories (RELs). We are excited to be part of four strong teams of practitioners and researchers that received the awards.

The original request for proposals in May 2011 called for the new RELs to work closely with alliances of state and local education agencies and other practitioner organizations to build local capacity for research. Considering the close ties between this agenda and Empirical’s core mission we joined the proposal efforts and are now part of winning teams in the West (led by WestEd), Northwest (led by Education Northwest), Midwest (led by the American Institutes for Research (AIR)), and Southwest (led by SEDL) The REL Southwest is currently under a stop work order while ED addresses a dispute concerning its review process. Empirical Education’s history in conducting Randomized Control Trials (RCTs) and in providing technical assistance to education agencies provides a strong foundation for the next five years.

2012-02-16

Exploration in the World of Experimental Evaluation

Our 300+ page report makes a good start. But IES, faced with limited time and resources to complete the many experiments being conducted within the Regional Education Lab system, put strict limits on the number of exploratory analyses researchers could conduct. We usually think of exploratory work as questions to follow up on puzzling or unanticipated results. However, in the case of the REL experiments, IES asked researchers to focus on a narrow set of “confirmatory” results and anything else was considered “exploratory,” even if the question was included in the original research design.

The strict IES criteria were based on the principle that when a researcher is using tests of statistical significance, the probability of erroneously concluding that there is an impact when there isn’t one increases with the frequency of the tests. In our evaluation of AMSTI, we limited ourselves to only four such “confirmatory” (i.e., not exploratory) tests of statistical significance. These were used to assess whether there was an effect on student outcomes for math problem-solving and for science, and the amount of time teachers spent on “active learning” practices in math and in science. (Technically, IES considered this two sets of two, since two were the primary student outcomes and two were the intermediate teacher outcomes.) The threshold for significance was made more stringent to keep the probability of falsely concluding that there was a difference for any of the outcomes at 5% (often expressed as p < .05).

While the logic for limiting the number of confirmatory outcomes is based on technical arguments about adjustments for multiple comparisons, the limit on the amount of exploratory work was based more on resource constraints. Researchers are notorious (and we don’t exempt ourselves) for finding more questions in any study than were originally asked. Curiosity-based exploration can indeed go on forever. In the case of our evaluation of AMSTI, however, there were a number of fundamental policy questions that were not answered either by the confirmatory or by the exploratory questions in our report. More research is needed.

Take the confirmatory finding that the program resulted in the equivalent of 28 days of additional math instruction (or technically an impact of 5% of a standard deviation). This is a testament to the hard work and ingenuity of the AMSTI team and the commitment of the school systems. From a state policy perspective, it gives a green light to continuing the initiative’s organic growth. But since all the schools in the experiment applied to join AMSTI, we don’t know what would happen if AMSTI were adopted as the state curriculum requiring schools with less interest to implement it. Our results do not generalize to that situation. Likewise, if another state with different levels of achievement or resources were to consider adopting it, we would say that our study gives good reason to try it but, to quote Lee Cronbach, a methodologist whose ideas increasingly resonate as we translate research into practice: “…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (Cronbach, 1975, p. 125).

The explorations we conducted as part of the AMSTI evaluation did not take the usual form of deeper examinations of interesting or unexpected findings uncovered during the planned evaluation. All the reported explorations were questions posed in the original study plan. They were defined as exploratory either because they were considered of secondary interest, such as the outcome for reading, or because they were not a direct causal result of the randomization, such as the results for subgroups of students defined by different demographic categories. Nevertheless, exploration of such differences is important for understanding how and for whom AMSTI works. The overall effect, averaging across subgroups, may mask differences that are of critical importance for policy

Readers interested in the issue of subgroup differences can refer to Table 6.11. Once differences are found in groups defined in terms of individual student characteristics, our real exploration is just beginning. For example, can the difference be accounted for by other characteristics or combinations of characteristics? Is there something that differentiates the classes or schools that different students attend? Such questions begin to probe additional factors that can potentially be addressed in the program or its implementation. In any case, the report just released is not the “final report.” There is still a lot of work necessary to understand how any program of this sort can continue to be improved.

2012-02-14
Archive