The U.S. Department of Education recently proposed new rules for including serious evaluations as part of its grant programs. The approach is modeled on how evaluations are used in the Investing in Innovation (i3) program where the proposal must show there’s some evidence that the proposed innovation has a chance of working and scaling and must include an evaluation that will add to a growing body of evidence about the innovation. We like this approach because it treats previous research as a hypothesis that the innovation may work in the new context. And each new grant is an opportunity to try the innovation in a new context, with improved approaches that warrant another check on effectiveness. But the proposed rules definitely had some weaknesses that were pointed out in the public comments available online. We hope ED heeds these suggestions.
Mark Schneiderman representing the Software and Information Industry Association (SIIA) recommends that outcomes used in effectiveness studies should not be limited to achievement scores.
SIIA notes that grant program resources could appropriately address a range of purposes from instructional to administrative, from assessment to professional development, and from data warehousing to systems productivity. The measures could therefore include such outcomes as student test scores, teacher retention rates, changes in classroom practice or efficiency, availability and use of data or other student/teacher/school outcomes, and cost effectiveness and efficiency that can be observed and measured. Many of these outcome measures can also be viewed as intermediate outcomes—changes in practice that, as demonstrated by other research, are likely to affect other final outcomes.
He also points out that quality of implementation and the nature of the comparison group can be the deciding factors in whether or not a program is found to be effective.
SIIA notes that in education there is seldom a pure control condition such as can be achieved in a medical trial with a placebo or sugar pill. Evaluations of education products and services resemble comparative effectiveness trials in which a new medication is tested against a currently approved one to determine whether it is significantly better. The same product may therefore prove effective in one district that currently has a weak program but relatively less effective in another where a strong program is in place. As a result, significant effects can often be difficult to discern.
This point gets to the heart of the contextual issues in any experimental evaluation. Without understanding the local conditions of the experiment the size of the impact for any other context cannot be anticipated. Some experimentalists would argue that a massive multi-site trial would allow averaging across many contextual variations. But such “on average” results won’t necessarily help the decision-maker working in specific local conditions. Thus, taking previous results as a rough indication that an innovation is worth trying is the first step before conducting the grant-funded evaluation of a new variation of the innovation under new conditions.
Jon Baron, writing for the Coalition for Evidence Based Policy expresses a fundamental concern about what counts as evidence. Jon, who is a former Chair of the National Board for Education Sciences and has been a prominent advocate for basing policy on rigorous research, suggests that
“the definition of ‘strong evidence of effectiveness’ in §77.1 incorporate the Investing in Innovation Fund’s (i3) requirement for effects that are ‘substantial and important’ and not just statistically significant.”
He cites examples where researchers have reported statistically significant results, which were based on trivial outcomes or had impacts so small as to have no practical value. Including “substantial and important” as additional criteria also captures the SIIA’s point that it is not sufficient to consider the internal validity of the study—policy makers must consider whether the measure used is an important one or whether the treatment-control contrast allows for detecting a substantial impact.
Addressing the substance and importance of the results gets us appropriately into questions of external validity, and leads us to questions about subgroup impact, where, for example, an innovation has a positive impact “on average” and works well for high scoring students but provides no value for low scoring students. We would argue that a positive average impact is not the most important part of the picture if the end result is an increase in a policy-relevant achievement gap. Should ED be providing grants for innovations where there has been a substantial indication that a gap is worsened? Probably yes, but only if the proposed development is aimed at fixing the malfunctioning innovation and if the program evaluation can address this differential impact.