blog posts and news stories

Report of the Evaluation of iRAISE Released

Empirical Education Inc. has completed its evaluation (read the report here) of an online professional development program for Reading Apprenticeship. WestEd’s Strategic Literacy Initiative (SLI) was awarded a development grant under the Investing in Innovation (i3) program in 2012. iRAISE (internet-based Reading Apprenticeship Improving Science Education) is an online professional development program for high school science teachers. iRAISE trained more than 100 teachers in Michigan and Pennsylvania over the three years of the grant. Empirical’s randomized control trial measured the impact of the program on students with special attention to differences in their incoming reading achievement levels.

The goal of iRAISE was to improve student achievement by training teachers in the use of Reading Apprenticeship, an instructional framework that describes the classroom in four interacting dimensions of learning: social, personal, cognitive, and knowledge-building. The inquiry-based professional development (PD) model included a week-long Foundations training in the summer; monthly synchronous group sessions and smaller personal learning communities; and asynchronous discussion groups designed to change teachers’ understanding of their role in adolescent literacy development and to build capacity for literacy instruction in the academic disciplines. iRAISE adapted an earlier face-to-face version of Reading Apprenticeship professional development, which was studied under an earlier i3 grant, Reading Apprenticeship Improving Secondary Education (RAISE), into a completely online course, creating a flexible, accessible platform.

To evaluate iRAISE, Empirical Education conducted an experiment in which 82 teachers across 27 schools were randomly assigned to either receive the iRAISE Professional Development during the 2014-15 school year or continue with business as usual and receive the program one year later. Data collection included monthly teacher surveys that measured their use of several classroom instructional practices and a spring administration of an online literacy assessment, developed by Educational Testing Service, to measure student achievement in literacy. We found significant positive impacts of iRAISE on several of the classroom practice outcomes, including teachers providing explicit instruction on comprehension strategies, their use of metacognitive inquiry strategies, and their levels of confidence in literacy instruction. These results were consistent with the prior RAISE research study and are an important replication of the previous findings, as they substantiate the success of SLI’s development of a more accessible online version of their teacher PD. After a one-year implementation with iRAISE, we do not find an overall effect of the program on student literacy achievement. However, we did find that levels of incoming reading achievement moderate the impact of iRAISE on general reading literacy such that lower scoring students benefit more. The success of iRAISE in adapting immersive, high-quality professional development to an online platform is promising for the field.

You can access the report and research summary from the study using the links below.
iRAISE research report
iRAISE research summary

2016-07-01

Five-year evaluation of Reading Apprenticeship i3 implementation reported at SREE

Empirical Education has released two research reports on the scale-up and impact of Reading Apprenticeship, as implemented under one of the first cohorts of Investing in Innovation (i3) grants. The Reading Apprenticeship Improving Secondary Education (RAISE) project reached approximately 2,800 teachers in five states with a program providing teacher professional development in content literacy in three disciplines: science, history, and English language arts. RAISE supported Empirical Education and our partner, IMPAQ International, in evaluating the innovation through both a randomized control trial encompassing 42 schools and a systematic study of the scale-up of 239 schools. The RCT found significant impact on student achievement in science classes consistent with prior studies. Mean impact across subjects, while positive, did not reach the .05 level of significance. The scale-up study found evidence that the strategy of building cross-disciplinary teacher teams within the school is associated with growth and sustainability of the program. Both sides of the evaluation were presented at the annual conference of the Society for Research on Educational Effectiveness, March 6-8, 2016 in Washington DC. Cheri Fancsali (formerly of IMPAQ, now at Research Alliance for NYC Schools) presented results of the RCT. Denis Newman (Empirical) presented a comparison of RAISE as instantiated in the RCT and scale-up contexts.

You can access the reports and research summaries from the studies using the links below.
RAISE RCT research report
RAISE RCT research summary
RAISE Scale-up research report
RAISE Scale-up research summary

2016-03-09

SREE Spring 2016 Conference Presentations

We are excited to be presenting two topics at the annual Spring Conference of The Society for Research on Educational Effectiveness (SREE) next week. Our first presentation addresses the problem of using multiple pieces of evidence to support decisions. Our second presentation compares the context of an RCT with schools implementing the same program without those constraints. If you’re at SREE, we hope to run into you, either at one of these presentations (details below) or at one of yours.

Friday, March 4, 2016 from 3:30 - 5PM
Roosevelt (“TR”) - Ritz-Carlton Hotel, Ballroom Level

6E. Evaluating Educational Policies and Programs
Evidence-Based Decision-Making and Continuous Improvement

Chair: Robin Wisniewski, RTI International

Does “What Works”, Work for Me?: Translating Causal Impact Findings from Multiple RCTs of a Program to Support Decision-Making
Andrew P. Jaciw, Denis Newman, Val Lazarev, & Boya Ma, Empirical Education



Saturday, March 5, 2016 from 10AM - 12PM
Culpeper - Fairmont Hotel, Ballroom Level

Session 8F: Evaluating Educational Policies and Programs & International Perspectives on Educational Effectiveness
The Challenge of Scale: Evidence from Charters, Vouchers, and i3

Chair: Ash Vasudeva, Bill & Melinda Gates Foundation

Comparing a Program Implemented under the Constraints of an RCT and in the Wild
Denis Newman, Valeriy Lazarev, & Jenna Zacamy, Empirical Education

2016-02-26

Understanding Logic Models Workshop Series

On July 17, Empirical Education facilitated the first of two workshops for practitioners in New Mexico on the development of program logic models, one of the first steps in developing a research agenda. The workshop, entitled “Identifying Essential Logic Model Components, Definitions, and Formats”, introduced the general concepts, purposes, and uses of program logic models to members of the Regional Education Lab (REL) Southwest’s New Mexico Achievement Gap Research Alliance. Throughout the workshop, participants collaborated with facilitators to build a logic model for a program or policy that participants are working on or that is of interest.

Empirical Education is part of the REL Southwest team, which assists Arkansas, Louisiana, New Mexico, Oklahoma, and Texas in using data and research evidence to address high-priority regional needs, including charter school effectiveness, early childhood education, Hispanic achievement in STEM, rural school performance, and closing the achievement gap, through six research alliances. The logic model workshops aim to strengthen the technical capacity of New Mexico Achievement Gap Research Alliance members to understand and visually represent their programs’ theories of change, identify key program components and outcomes, and use logic models to develop research questions. Both workshops are being held in Albuquerque, New Mexico.

2014-06-17

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31

Need for Product Evaluations Continues to Grow

There is a growing need for evidence of the effectiveness of products and services being sold to schools. A new release of SIIA’s product evaluation guidelines is now available at the Selling to Schools website (with continued free access to SIIA members), to help guide publishers in measuring the effectiveness of the tools they are selling to schools.

It’s been almost a decade since NCLB made its call for “scientifically-based research,” but the calls for research haven’t faded away. This is because resources available to schools have diminished over that time, heightening the importance of cost benefit trade-offs in spending.

NCLB has focused attention on test score achievement, and this metric is becoming more pervasive; e.g., through a tie to teacher evaluation and through linkages to dropout risk. While NCLB fostered a compliance mentality—product specs had to have a check mark next to SBR—the need to assure that funds are not wasted is now leading to a greater interest in research results. Decision-makers are now very interested in whether specific products will be effective, or how well they have been working, in their districts.

Fortunately, the data available for evaluations of all kinds is getting better and easier to access. The U.S. Department of Education has poured hundreds of millions of dollars into state data systems. These investments make data available to states and drive the cleaning and standardizing of data from districts. At the same time, districts continue to invest in data systems and warehouses. While still not a trivial task, the ability of school district researchers to get the data needed to determine if an investment paid off—in terms of increased student achievement or attendance—has become much easier over the last decade.

The reauthorization of ESEA (i.e., NCLB) is maintaining the pressure to evaluate education products. We are still a long way from the draft reauthorization introduced in Congress becoming a law, but the initial indications are quite favorable to the continued production of product effectiveness evidence. The language has changed somewhat. Look for the phrase “evidence based”. Along with the term “scientifically-valid”, this new language is actually more sophisticated and potentially more effective than the old SBR neologism. Bob Slavin, one of the reviewers of the SIIA guidelines, says in his Ed Week blog that “This is not the squishy ‘based on scientifically-based evidence’ of NCLB. This is the real McCoy.” It is notable that the definition of “evidence-based” goes beyond just setting rules for the design of research, such as the SBR focus on the single dimension of “internal validity” for which randomization gets the top rating. It now asks how generalizable the research is or its “external validity”; i.e., does it have any relevance for decision-makers?

One of the important goals of the SIIA guidelines for product effectiveness research is to improve the credibility of publisher-sponsored research. It is important that educators see it as more than just “market research” producing biased results. In this era of reduced budgets, schools need to have tangible evidence of the value of products they buy. By following the SIIA’s guidelines, publishers will find it easier to achieve that credibility.

2011-11-12

Comment on the NY Times: In Classroom of Future, Stagnant Scores

The New York Times is running a series of front-page articles on “Grading the Digital School.” The first one ran Labor Day weekend and raised the question as to whether there’s any evidence that would persuade a school board or community to allocate extra funds for technology. With the demise of the Enhancing Education Through Technology (EETT) program, federal funds dedicated to technology will no longer be flowing into states and districts. Technology will have to be measured against any other discretionary purchase. The resulting internal debates within schools and their communities about the expense vs. value of technology promise to have interesting implications and are worth following closely.

The first article by Matt Richtel revisits a debate that has been going on for decades between those who see technology as the key to “21st Century learning” and those who point to the dearth of evidence that technology makes any measurable difference to learning. It’s time to try to reframe this discussion in terms of what can be measured. And in considering what to measure, and in honor of Labor Day, we raise a question that is often ignored: what role do teachers play in generating the measurable value of technology?

Let’s start with the most common argument in favor of technology, even in the absence of test score gains. The idea is that technology teaches skills “needed in a modern economy,” and these are not measured by the test scores used by state and federal accountability systems. Karen Cator, director of the U.S. Department of Education office of educational technology, is quoted as saying (in reference to the lack of improvement in test scores), “…look at all the other things students are doing: learning to use the Internet to research, learning to organize their work, learning to use professional writing tools, learning to collaborate with others.” Presumably, none of these things directly impact test scores. The problem with this perennial argument is that many other things that schools keep track of should provide indicators of improvement. If as a result of technology, students are more excited about learning or more engaged in collaborating, we could look for an improvement in attendance, a decrease in drop-outs, or students signing up for more challenging courses.

Information on student behavioral indicators is becoming easier to obtain since the standardization of state data systems. There are some basic study designs that use comparisons among students within the district or between those in the district and those elsewhere in the state. This approach uses statistical modeling to identify trends and control for demographic differences, but is not beyond the capabilities of many school district research departments1 or the resources available to the technology vendors. (Empirical has conducted research for many of the major technology providers, often focusing on results for a single district interested in obtaining evidence to support local decisions.) Using behavioral or other indicators, a district such as that in the Times article can answer its own questions. Data from the technology systems themselves can be used to identify users and non-users and to confirm the extent of usage and implementation. It is also valuable to examine whether some students (those in most need or those already doing okay) or some teachers (veterans or novices) receive greater benefit from the technology. This information may help the district focus resources where they do the most good.

A final thought about where to look for impacts of technologies comes from a graph of the school district’s budget. While spending on technology and salaries have both declined over the last three years, spending on salaries is still about 25 times as great as on technologies. Any discussion of where to find an impact of technology must consider labor costs, which are the district’s primary investment. We might ask whether a small investment in technology would allow the district to reduce the numbers of teachers by, for example, allowing a small increase in the number of students each teacher can productively handle. Alternatively, we might ask whether technology can make a teacher more effective, by whatever measures of effective teaching the district chooses to use, with their current students. We might ask whether technologies result in keeping young teachers on the job longer or encouraging initiative to take on more challenging assignments.

It may be a mistake to look for a direct impact of technology on test scores (aside from technologies aimed specifically at that goal), but it is also a mistake to assume the impact is, in principle, not measurable. We need a clear picture of how various technologies are expected to work and where we can look for the direct and indirect effects. An important role of technology in the modern economy is providing people with actionable evidence. It would be ironic if education technology was inherently opaque to educational decision makers.

1 Or we would hope, the New York Times. Sadly, the article provides a graph of trends in math and reading for the district highlighted in the story compared to trends for the state. The graphic is meant to show that the district is doing worse than the state average. But the article never suggests that we should consider the population of the particular district and whether it is doing better or worse than one would expect, controlling for demographics, available resources, and other characteristics.

2011-09-12

Comment on the NY Times: In Classroom of Future, Stagnant Scores

The New York Times is running a series of front-page articles on “Grading the Digital School.” The first one ran Labor Day weekend and raised the question as to whether there’s any evidence that would persuade a school board or community to allocate extra funds for technology. With the demise of the Enhancing Education Through Technology (EETT) program, federal funds dedicated to technology will no longer be flowing into states and districts. Technology will have to be measured against any other discretionary purchase. The resulting internal debates within schools and their communities about the expense vs. value of technology promise to have interesting implications and are worth following closely.

The first article by Matt Richtel revisits a debate that has been going on for decades between those who see technology as the key to “21st Century learning” and those who point to the dearth of evidence that technology makes any measurable difference to learning. It’s time to try to reframe this discussion in terms of what can be measured. And in considering what to measure, and in honor of Labor Day, we raise a question that is often ignored: what role do teachers play in generating the measurable value of technology?

Let’s start with the most common argument in favor of technology, even in the absence of test score gains. The idea is that technology teaches skills “needed in a modern economy,” and these are not measured by the test scores used by state and federal accountability systems. Karen Cator, director of the U.S. Department of Education office of educational technology, is quoted as saying (in reference to the lack of improvement in test scores), “…look at all the other things students are doing: learning to use the Internet to research, learning to organize their work, learning to use professional writing tools, learning to collaborate with others.” Presumably, none of these things directly impact test scores. The problem with this perennial argument is that many other things that schools keep track of should provide indicators of improvement. If as a result of technology, students are more excited about learning or more engaged in collaborating, we could look for an improvement in attendance, a decrease in drop-outs, or students signing up for more challenging courses.

Information on student behavioral indicators is becoming easier to obtain since the standardization of state data systems. There are some basic study designs that use comparisons among students within the district or between those in the district and those elsewhere in the state. This approach uses statistical modeling to identify trends and control for demographic differences, but is not beyond the capabilities of many school district research departments1 or the resources available to the technology vendors. (Empirical has conducted research for many of the major technology providers, often focusing on results for a single district interested in obtaining evidence to support local decisions.) Using behavioral or other indicators, a district such as that in the Times article can answer its own questions. Data from the technology systems themselves can be used to identify users and non-users and to confirm the extent of usage and implementation. It is also valuable to examine whether some students (those in most need or those already doing okay) or some teachers (veterans or novices) receive greater benefit from the technology. This information may help the district focus resources where they do the most good.

A final thought about where to look for impacts of technologies comes from a graph of the school district’s budget. While spending on technology and salaries have both declined over the last three years, spending on salaries is still about 25 times as great as on technologies. Any discussion of where to find an impact of technology must consider labor costs, which are the district’s primary investment. We might ask whether a small investment in technology would allow the district to reduce the numbers of teachers by, for example, allowing a small increase in the number of students each teacher can productively handle. Alternatively, we might ask whether technology can make a teacher more effective, by whatever measures of effective teaching the district chooses to use, with their current students. We might ask whether technologies result in keeping young teachers on the job longer or encouraging initiative to take on more challenging assignments.

It may be a mistake to look for a direct impact of technology on test scores (aside from technologies aimed specifically at that goal), but it is also a mistake to assume the impact is, in principle, not measurable. We need a clear picture of how various technologies are expected to work and where we can look for the direct and indirect effects. An important role of technology in the modern economy is providing people with actionable evidence. It would be ironic if education technology was inherently opaque to educational decision makers.

1 Or we would hope, the New York Times. Sadly, the article provides a graph of trends in math and reading for the district highlighted in the story compared to trends for the state. The graphic is meant to show that the district is doing worse than the state average. But the article never suggests that we should consider the population of the particular district and whether it is doing better or worse than one would expect, controlling for demographics, available resources, and other characteristics.

2011-09-12

New RFP calls for Building Regional Research Capacity

The US Department of Education (ED) has just released the eagerly anticipated RFP for the next round of the Regional Education Laboratories (RELs). This RFP contains some very interesting departures from how the RELs have been working, which may be of interest especially to state and local educators.

For those unfamiliar with federal government organizations, the RELs are part of the National Center for Education Evaluation and Regional Assistance (abbreviated NCEE), which is within the Institute of Education Sciences (IES), part of ED. The country is divided up into ten regions, each one served by a REL—so the RFP announced today is really a call for proposals in ten different competitions. The RELs have been in existence for decades but their mission has evolved over time. For example, the previous RFP (about 6 years ago) put a strong emphasis on rigorous research, particularly randomized control trials (RCTs) leading the contractors in each of the 10 regions to greatly expand their capacity, in part by bringing in subcontractors with the requisite technical skills. (Empirical conducted or assisted with RCTs in four of the 10 regions.) The new RFP changes the focus in two essential ways.

First, one of the major tasks is building capacity for research among practitioners. Educators at the state and local levels told ED that they needed more capacity to make use of the longitudinal data systems that the ED has invested in through grants to the states. It is one thing to build the data systems. It is another thing to use the data to generate evidence that can inform decisions about policies and programs. Last month at the conference of the Society for Research on Educational Effectiveness, Rebecca Maynard, Commissioner of NCEE talked about building a “culture of experimentation” among practitioners and building their capacity for simpler experiments that don’t take so long and are not as expensive as those NCEE has typically contracted for. Her point was that the resulting evidence is more likely to be used if the practitioners are “up close and immediate.”

The second idea found in the RFP for the RELs is that each regional lab should work through “alliances” of state and local agencies. These alliances would cross state boundaries (at least within the region) and would provide an important part of the REL’s research agenda. The idea goes beyond having an advisory panel for the REL that requests answers to questions. The alliances are also expected to build their own capacity to answer these questions using rigorous research methods but applying them cost-effectively and opportunistically. The capacity of the alliances should outlive the support provided by the RELs. If your organization is part of an existing alliance and would like to get better at using and conducting research, there are teams being formed to go after the REL contracts that would be happy to hear from you. (If you’re not sure who to call, let us know and we’ll put you in touch with an appropriate team.)

2011-05-11

Looking Back 35 Years to Learn about Local Experiments

With the growing interest among federal agencies in building local capacity for research, we took another look at an article by Lee Cronbach published in 1975. We found it has a lot to say about conducting local experiments and implications for generalizability. Cronbach worked for much of his career at Empirical’s neighbor, Stanford University, and his work has had a direct and indirect influence on our thinking. Some may interpret Cronbach’s work as stating that randomized trials of educational interventions have no value because of the complexity of interactions between subjects, contexts, and the experimental treatment. In any particular context, these interactions are infinitely complex, forming a “hall of mirrors” (as he famously put it, p. 119), making experimental results—which at most can address a small number of lower-order interactions—irrelevant. We don’t read it that way. Rather, we see powerful insights as well as cautions for conducting the kinds of field experiments that are beginning to show promise for providing educators with useful evidence.

We presented these ideas at the Society for Research in Educational Effectiveness conference in March, building the presentation around a set of memorable quotes from the 1975 article. Here we highlight some of the main ideas.

Quote #1: “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125).

Practitioners are making decisions for their local jurisdiction. An experiment conducted elsewhere (including over many locales, where the results are averaged) provides a useful starting point, but not “proof” that it will or will not work in the same way locally. Experiments give us a working hypothesis concerning an effect, but it has to be tested against local conditions at the appropriate scale of implementation. This brings to mind California’s experience with class size reduction following the famous experiment in Tennessee, and how the working hypothesis corroborated through the experiment did not transfer to a different context. We also see applicability of Cronbach’s ideas in the Investing in Innovation (i3) program, where initial evidence is being taken as a warrant to scale-up intervention, but where the grants included funding for research under new conditions where implementation may head in unanticipated directions, leading to new effects.

Quote #2: “Instead of making generalization the ruling consideration in our research, I suggest that we reverse our priorities. An observer collecting data in one particular situation…will give attention to whatever variables were controlled, but he will give equally careful attention to uncontrolled conditions…. As results accumulate, a person who seeks understanding will do his best to trace how the uncontrolled factors could have caused local departures from the modal effect. That is, generalization comes late, and the exception is taken as seriously as the rule” (pp. 124-125).

Finding or even seeking out conditions that lead to variation in the treatment effect facilitates external validity, as we build an account of the variation. This should not be seen as a threat to generalizability because an estimate of average impact is not robust across conditions. We should spend some time looking at the ways that the intervention interacts differently with local characteristics, in order to determine which factors account for heterogeneity in the impact and which ones do not. Though this activity is exploratory and not necessarily anticipated in the design, it provides the basis for understanding how the treatment plays out, and why its effect may not be constant across settings. Over time, generalizations can emerge, as we compile an account of the different ways in which the treatment is realized and the conditions that suppress or accentuate its effects.

Quote #3: “Generalizations decay” (p. 122).

In the social policy arena, and especially with the rapid development of technologies, we can’t expect interventions to stay constant. And we certainly can’t expect the contexts of implementation to be the same over many years. The call for quicker turn-around in our studies is therefore necessary, not just because decision-makers need to act, but because any finding may have a short shelf life.

Cronbach, L. J. (1975). Beyond the two disciplines of scientifi­c psychology. American Psychologist, 116-127.

2011-03-21
Archive