blog posts and news stories

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.

2014-05-21

Importance is Important for Rules of Evidence Proposed for ED Grant Programs

The U.S. Department of Education recently proposed new rules for including serious evaluations as part of its grant programs. The approach is modeled on how evaluations are used in the Investing in Innovation (i3) program where the proposal must show there’s some evidence that the proposed innovation has a chance of working and scaling and must include an evaluation that will add to a growing body of evidence about the innovation. We like this approach because it treats previous research as a hypothesis that the innovation may work in the new context. And each new grant is an opportunity to try the innovation in a new context, with improved approaches that warrant another check on effectiveness. But the proposed rules definitely had some weak points that were pointed out in the public comments, which are available online. We hope ED heeds these suggestions.

Mark Schneiderman representing the Software and Information Industry Association (SIIA) recommends that outcomes used in effectiveness studies should not be limited to achievement scores.

SIIA notes that grant program resources could appropriately address a range of purposes from instructional to administrative, from assessment to professional development, and from data warehousing to systems productivity. The measures could therefore include such outcomes as student test scores, teacher retention rates, changes in classroom practice or efficiency, availability and use of data or other student/teacher/school outcomes, and cost effectiveness and efficiency that can be observed and measured. Many of these outcome measures can also be viewed as intermediate outcomes—changes in practice that, as demonstrated by other research, are likely to affect other final outcomes.

He also points out that quality of implementation and the nature of the comparison group can be the deciding factors in whether or not a program is found to be effective.

SIIA notes that in education there is seldom a pure control condition such as can be achieved in a medical trial with a placebo or sugar pill. Evaluations of education products and services resemble comparative effectiveness trials in which a new medication is tested against a currently approved one to determine whether it is significantly better. The same product may therefore prove effective in one district that currently has a weak program but relatively less effective in another where a strong program is in place. As a result, significant effects can often be difficult to discern.

This point gets to the heart of the contextual issues in any experimental evaluation. Without understanding the local conditions of the experiment the size of the impact for any other context cannot be anticipated. Some experimentalists would argue that a massive multi-site trial would allow averaging across many contextual variations. But such “on average” results won’t necessarily help the decision-maker working in specific local conditions. Thus, taking previous results as a rough indication that an innovation is worth trying is the first step before conducting the grant-funded evaluation of a new variation of the innovation under new conditions.

Jon Baron, writing for the Coalition for Evidence Based Policy expresses a fundamental concern about what counts as evidence. Jon, who is a former Chair of the National Board for Education Sciences and has been a prominent advocate for basing policy on rigorous research, suggests that

“the definition of ‘strong evidence of effectiveness’ in §77.1 incorporate the Investing in Innovation Fund’s (i3) requirement for effects that are ‘substantial and important’ and not just statistically significant.”

He cites examples where researchers have reported statistically significant results, which were based on trivial outcomes or had impacts so small as to have no practical value. Including “substantial and important” as additional criteria also captures the SIIA’s point that it is not sufficient to consider the internal validity of the study—policy makers must consider whether the measure used is an important one or whether the treatment-control contrast allows for detecting a substantial impact.

Addressing the substance and importance of the results gets us appropriately into questions of external validity, and leads us to questions about subgroup impact, where, for example, an innovation has a positive impact “on average” and works well for high scoring students but provides no value for low scoring students. We would argue that a positive average impact is not the most important part of the picture if the end result is an increase in a policy-relevant achievement gap. Should ED be providing grants for innovations where there has been a substantial indication that a gap is worsened? Probably yes, but only if the proposed development is aimed at fixing the malfunctioning innovation and if the program evaluation can address this differential impact.

2013-03-17

The Value of Looking at Local Results

The report we released today has an interesting history that shows the value of looking beyond the initial results of an experiment. Later this week, we are presenting a paper at AERA entitled “In School Settings, Are All RCTs Exploratory?” The findings we report from our experiment with an iPad application were part of the inspiration for this. If Riverside Unified had not looked at its own data, we would not, in the normal course of data analysis, have broken the results out by individual districts, and our conclusion would have been that there was no discernible impact of the app. We can cite many other cases where looking at subgroups leads us to conclusions different from the conclusion based on the result averaged across the whole sample. Our report on AMSTI is another case we will cite in our AERA paper.

We agree with the Institute of Education Sciences (IES) in taking a disciplined approach in requiring that researchers “call their shots” by naming the small number of outcomes considered most important in any experiment. All other questions are fine to look at but fall into the category of exploratory work. What we want to guard against, however, is the implication that answers to primary questions, which often are concerned with average impacts for the study sample as a whole, must apply to various subgroups within the sample, and therefore can be broadly generalized by practitioners, developers, and policy makers.

If we find an average impact but in exploratory analysis discover plausible, policy-relevant, and statistically strong differential effects for subgroups, then some doubt about completeness may be cast on the value of the confirmatory finding. We may not be certain of a moderator effect—for example—but once it comes to light, the value of the average impact can also be considered incomplete or misleading for practical purposes. If it is necessary to conduct an additional experiment to verify a differential subgroup impact, the same experiment may verify that the average impact is not what practitioners, developers, and policy makers should be concerned with.

In our paper at AERA, we are proposing that any result from a school-based experiment should be treated as provisional by practitioners, developers, and policy makers. The results of RCTs can be very useful, but the challenges of generalizability of the results from even the most stringently designed experiment mean that the results should be considered the basis for a hypothesis that the intervention may work under similar conditions. For a developer considering how to improve an intervention, the specific conditions under which it appeared to work or not work is the critical information to have. For a school system decision maker, the most useful pieces of information are insight into subpopulations that appear to benefit and conditions that are favorable for implementation. For those concerned with educational policy, it is often the case that conditions and interventions change and develop more rapidly than research studies can be conducted. Using available evidence may mean digging through studies that have confirmatory results in contexts similar or different from their own and examining exploratory analyses that provide useful hints as to the most productive steps to take next. The practitioner in this case is in a similar position to the researcher considering the design of the next experiment. The practitioner also has to come to a hypothesis about how things work as the basis for action.

2012-04-01

Study of Alabama STEM Initiative Finds Positive Impacts

On February 21, 2012 the U.S. Department of Education released the final report of an experiment that Empirical Education has been working on for the last six years. The report, titled Evaluation of the Effectiveness of the Alabama Math, Science, and Technology Initiative (AMSTI) is now available on the Institute of Education Sciences website. The Alabama State Department of Education held a press conference to announce the findings, attended by Superintendent of Education Bice, staff of AMSTI, along with educators, students, and co-principal investigator of the study, Denis Newman, CEO of Empirical Education. The press release issued by the Alabama State Department of Education and a WebEx presentation provide more detail on the study’s findings.

AMSTI was developed by the state of Alabama and introduced in 2002 with the goal of improving mathematics and science achievement in the state’s K-12 schools. Empirical Education was primarily responsible for conducting the study—including the design, data collection, analysis, and reporting—under its subcontract with the Regional Education Lab, Southeast (the study was initiated through a research grant to Empirical). Researchers from Academy of Education Development, Abt Associates, and ANALYTICA made important contributions to design, analysis and data collection.

The findings show that after one year, students in the 41 AMSTI schools experienced an impact on mathematics achievement equivalent to 28 days of additional student progress over students receiving conventional mathematics instruction. The study found, after one year, no difference for science achievement. It also found that AMSTI had an impact on teachers’ active learning classroom practices in math and science that, according to the theory of action posited by AMSTI, should have an impact on achievement. Further exploratory analysis found effects for student achievement in both mathematics and science after two years. The study also explored reading achievement, where it found significant differences between the AMSTI and control groups after one year. Exploration of differential effect for student demographic categories found consistent results for gender, socio-economic status, and pretest achievement level for math and science. For reading, however, the breakdown by student ethnicity suggests a differential benefit.

Just about everybody at Empirical worked on this project at one point or another. Besides the three of us (Newman, Jaciw and Zacamy) who are listed among the authors, we want to acknowledge past and current employees whose efforts made the project possible: Jessica Cabalo, Ruthie Chang, Zach Chin, Huan Cung, Dan Ho, Akiko Lipton, Boya Ma, Robin Means, Gloria Miller, Bob Smith, Laurel Sterling, Qingfeng Zhao, Xiaohui Zheng, and Margit Zsolnay.

With solid cooperation of the state’s Department of Education and the AMSTI team, approximately 780 teachers and 30,000 upper-elementary and middle school students in 82 schools from five regions in Alabama participated in the study. The schools were randomized into one of two categories: 1) Those who received AMSTI starting the first year, or 2) Those who received “business as usual” the first year and began participation in AMSTI the second year. With only a one-year delay before the control group entered treatment, the two-year impact was estimated using statistical techniques developed by, and with the assistance of our colleagues at Abt Associates. Academy for Education Development assisted with data collection and analysis of training and program implementation.

Findings of the AMSTI study will also be presented at the Society for Research on Educational Effectiveness (SREE) Spring Conference taking place in Washington D.C. from March 8-10, 2012. Join Denis Newman, Andrew Jaciw, and Boya Ma on Friday March 9, 2012 from 3:00pm-4:30pm, when they will present findings of their study titled, “Locating Differential Effectiveness of a STEM Initiative through Exploration of Moderators.” A symposium on the study, including the major study collaborators, will be presented at the annual conference of the American Educational Research Association (AERA) on April 15, 2012 from 2:15pm-3:45pm at the Marriott Pinnacle ⁄ Pinnacle III in Vancouver, Canada. This session will be chaired by Ludy van Broekhuizen (director of REL-SE) and will include presentations by Steve Ricks (director of AMSTI); Jean Scott (SERVE Center at UNCG); Denis Newman, Andrew Jaciw, Boya Ma, and Jenna Zacamy (Empirical Education); Steve Bell (Abt Associates); and Laura Gould (formerly of AED). Sean Reardon (Stanford) will serve as the discussant. A synopsis of the study will also be included in the Common Guidelines for Education Research and Development.

2012-02-21

Exploration in the World of Experimental Evaluation

Our 300+ page report makes a good start. But IES, faced with limited time and resources to complete the many experiments being conducted within the Regional Education Lab system, put strict limits on the number of exploratory analyses researchers could conduct. We usually think of exploratory work as questions to follow up on puzzling or unanticipated results. However, in the case of the REL experiments, IES asked researchers to focus on a narrow set of “confirmatory” results and anything else was considered “exploratory,” even if the question was included in the original research design.

The strict IES criteria were based on the principle that when a researcher is using tests of statistical significance, the probability of erroneously concluding that there is an impact when there isn’t one increases with the frequency of the tests. In our evaluation of AMSTI, we limited ourselves to only four such “confirmatory” (i.e., not exploratory) tests of statistical significance. These were used to assess whether there was an effect on student outcomes for math problem-solving and for science, and the amount of time teachers spent on “active learning” practices in math and in science. (Technically, IES considered this two sets of two, since two were the primary student outcomes and two were the intermediate teacher outcomes.) The threshold for significance was made more stringent to keep the probability of falsely concluding that there was a difference for any of the outcomes at 5% (often expressed as p < .05).

While the logic for limiting the number of confirmatory outcomes is based on technical arguments about adjustments for multiple comparisons, the limit on the amount of exploratory work was based more on resource constraints. Researchers are notorious (and we don’t exempt ourselves) for finding more questions in any study than were originally asked. Curiosity-based exploration can indeed go on forever. In the case of our evaluation of AMSTI, however, there were a number of fundamental policy questions that were not answered either by the confirmatory or by the exploratory questions in our report. More research is needed.

Take the confirmatory finding that the program resulted in the equivalent of 28 days of additional math instruction (or technically an impact of 5% of a standard deviation). This is a testament to the hard work and ingenuity of the AMSTI team and the commitment of the school systems. From a state policy perspective, it gives a green light to continuing the initiative’s organic growth. But since all the schools in the experiment applied to join AMSTI, we don’t know what would happen if AMSTI were adopted as the state curriculum requiring schools with less interest to implement it. Our results do not generalize to that situation. Likewise, if another state with different levels of achievement or resources were to consider adopting it, we would say that our study gives good reason to try it but, to quote Lee Cronbach, a methodologist whose ideas increasingly resonate as we translate research into practice: “…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (Cronbach, 1975, p. 125).

The explorations we conducted as part of the AMSTI evaluation did not take the usual form of deeper examinations of interesting or unexpected findings uncovered during the planned evaluation. All the reported explorations were questions posed in the original study plan. They were defined as exploratory either because they were considered of secondary interest, such as the outcome for reading, or because they were not a direct causal result of the randomization, such as the results for subgroups of students defined by different demographic categories. Nevertheless, exploration of such differences is important for understanding how and for whom AMSTI works. The overall effect, averaging across subgroups, may mask differences that are of critical importance for policy

Readers interested in the issue of subgroup differences can refer to Table 6.11. Once differences are found in groups defined in terms of individual student characteristics, our real exploration is just beginning. For example, can the difference be accounted for by other characteristics or combinations of characteristics? Is there something that differentiates the classes or schools that different students attend? Such questions begin to probe additional factors that can potentially be addressed in the program or its implementation. In any case, the report just released is not the “final report.” There is still a lot of work necessary to understand how any program of this sort can continue to be improved.

2012-02-14

Looking Back 35 Years to Learn about Local Experiments

With the growing interest among federal agencies in building local capacity for research, we took another look at an article by Lee Cronbach published in 1975. We found it has a lot to say about conducting local experiments and implications for generalizability. Cronbach worked for much of his career at Empirical’s neighbor, Stanford University, and his work has had a direct and indirect influence on our thinking. Some may interpret Cronbach’s work as stating that randomized trials of educational interventions have no value because of the complexity of interactions between subjects, contexts, and the experimental treatment. In any particular context, these interactions are infinitely complex, forming a “hall of mirrors” (as he famously put it, p. 119), making experimental results—which at most can address a small number of lower-order interactions—irrelevant. We don’t read it that way. Rather, we see powerful insights as well as cautions for conducting the kinds of field experiments that are beginning to show promise for providing educators with useful evidence.

We presented these ideas at the Society for Research in Educational Effectiveness conference in March, building the presentation around a set of memorable quotes from the 1975 article. Here we highlight some of the main ideas.

Quote #1: “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125).

Practitioners are making decisions for their local jurisdiction. An experiment conducted elsewhere (including over many locales, where the results are averaged) provides a useful starting point, but not “proof” that it will or will not work in the same way locally. Experiments give us a working hypothesis concerning an effect, but it has to be tested against local conditions at the appropriate scale of implementation. This brings to mind California’s experience with class size reduction following the famous experiment in Tennessee, and how the working hypothesis corroborated through the experiment did not transfer to a different context. We also see applicability of Cronbach’s ideas in the Investing in Innovation (i3) program, where initial evidence is being taken as a warrant to scale-up intervention, but where the grants included funding for research under new conditions where implementation may head in unanticipated directions, leading to new effects.

Quote #2: “Instead of making generalization the ruling consideration in our research, I suggest that we reverse our priorities. An observer collecting data in one particular situation…will give attention to whatever variables were controlled, but he will give equally careful attention to uncontrolled conditions…. As results accumulate, a person who seeks understanding will do his best to trace how the uncontrolled factors could have caused local departures from the modal effect. That is, generalization comes late, and the exception is taken as seriously as the rule” (pp. 124-125).

Finding or even seeking out conditions that lead to variation in the treatment effect facilitates external validity, as we build an account of the variation. This should not be seen as a threat to generalizability because an estimate of average impact is not robust across conditions. We should spend some time looking at the ways that the intervention interacts differently with local characteristics, in order to determine which factors account for heterogeneity in the impact and which ones do not. Though this activity is exploratory and not necessarily anticipated in the design, it provides the basis for understanding how the treatment plays out, and why its effect may not be constant across settings. Over time, generalizations can emerge, as we compile an account of the different ways in which the treatment is realized and the conditions that suppress or accentuate its effects.

Quote #3: “Generalizations decay” (p. 122).

In the social policy arena, and especially with the rapid development of technologies, we can’t expect interventions to stay constant. And we certainly can’t expect the contexts of implementation to be the same over many years. The call for quicker turn-around in our studies is therefore necessary, not just because decision-makers need to act, but because any finding may have a short shelf life.

Cronbach, L. J. (1975). Beyond the two disciplines of scientifi­c psychology. American Psychologist, 116-127.

2011-03-21

Looking Back 35 Years to Learn about Local Experiments

With the growing interest among federal agencies in building local capacity for research, we took another look at an article by Lee Cronbach published in 1975. We found it has a lot to say about conducting local experiments and implications for generalizability. Cronbach worked for much of his career at Empirical’s neighbor, Stanford University, and his work has had a direct and indirect influence on our thinking. Some may interpret Cronbach’s work as stating that randomized trials of educational interventions have no value because of the complexity of interactions between subjects, contexts, and the experimental treatment. In any particular context, these interactions are infinitely complex, forming a “hall of mirrors” (as he famously put it, p. 119), making experimental results—which at most can address a small number of lower-order interactions—irrelevant. We don’t read it that way. Rather, we see powerful insights as well as cautions for conducting the kinds of field experiments that are beginning to show promise for providing educators with useful evidence.

We presented these ideas at the Society for Research in Educational Effectiveness conference in March, building the presentation around a set of memorable quotes from the 1975 article. Here we highlight some of the main ideas.

Quote #1: “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125).

Practitioners are making decisions for their local jurisdiction. An experiment conducted elsewhere (including over many locales, where the results are averaged) provides a useful starting point, but not “proof” that it will or will not work in the same way locally. Experiments give us a working hypothesis concerning an effect, but it has to be tested against local conditions at the appropriate scale of implementation. This brings to mind California’s experience with class size reduction following the famous experiment in Tennessee, and how the working hypothesis corroborated through the experiment did not transfer to a different context. We also see applicability of Cronbach’s ideas in the Investing in Innovation (i3) program, where initial evidence is being taken as a warrant to scale-up intervention, but where the grants included funding for research under new conditions where implementation may head in unanticipated directions, leading to new effects.

Quote #2: “Instead of making generalization the ruling consideration in our research, I suggest that we reverse our priorities. An observer collecting data in one particular situation…will give attention to whatever variables were controlled, but he will give equally careful attention to uncontrolled conditions…. As results accumulate, a person who seeks understanding will do his best to trace how the uncontrolled factors could have caused local departures from the modal effect. That is, generalization comes late, and the exception is taken as seriously as the rule” (pp. 124-125).

Finding or even seeking out conditions that lead to variation in the treatment effect facilitates external validity, as we build an account of the variation. This should not be seen as a threat to generalizability because an estimate of average impact is not robust across conditions. We should spend some time looking at the ways that the intervention interacts differently with local characteristics, in order to determine which factors account for heterogeneity in the impact and which ones do not. Though this activity is exploratory and not necessarily anticipated in the design, it provides the basis for understanding how the treatment plays out, and why its effect may not be constant across settings. Over time, generalizations can emerge, as we compile an account of the different ways in which the treatment is realized and the conditions that suppress or accentuate its effects.

Quote #3: “Generalizations decay” (p. 122).

In the social policy arena, and especially with the rapid development of technologies, we can’t expect interventions to stay constant. And we certainly can’t expect the contexts of implementation to be the same over many years. The call for quicker turn-around in our studies is therefore necessary, not just because decision-makers need to act, but because any finding may have a short shelf life.

Cronbach, L. J. (1975). Beyond the two disciplines of scientifi­c psychology. American Psychologist, 116-127.

2011-03-21

Webinar: Uncovering ARRA’s Research Requirements

Researchers at Empirical Education provided a detailed overview of the various research themes and requirements of the ARRA stimulus initiatives with state department of education officials during their December 9 webinar entitled, “Meet Stimulus Funds’ Research Requirements with Confidence.“ The webinar gave specific examples of how states may start planning their applications and building research partnerships, as well as an overview of the ED’s current thinking about building local research capacity. The initiatives that were discussed included Race to the Top, Enhancing Education Through Technology, Investing in Innovation, Title I School Improvement Grants, and State Longitudinal Data Systems.

A follow-up webinar was broadcasted on January 20, 2010; it outlined a specific example of a program evaluation design that districts can use with existing data. The presentation can be viewed below. Stay tuned for future webinar topics on more alternative experimental research designs.

2010-01-22

Presentation at the Society for Research in Educational Effectiveness (SREE) Explores Methods for Studying Achievement Gaps

Frequently in Empirical Education’s experimental evaluations for school districts, the question of local concern is an achievement gap identified between two student groups. The analysis of these experiments also often finds significant differences between these subgroups in how effective the intervention was (that is, if it increased or decreased the gap) while not finding a significant overall difference. In his 2005 book, Howard Bloom suggested why there may be more statistical power to detect subgroup differences than to detect the average effect. The exploration presented at SREE, which was held in Washington March 1-3, examined the statistical characteristics of eight experiments conducted over the last three years to find out whether a critical assumption of Bloom’s approach held. His assumption is that the average performance gap does not vary across the units that are randomized. The work, led by Andrew P. Jaciw, Empirical Education’s Director of Experimental Design and Analysis, found that the assumption held. This finding is important because it suggests that local experiments focusing on achievement gaps may be less expensive than experiments addressing only the overall average effect of an intervention. (Click here for a copy of the poster and handout.)

Bloom, H. S., (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed). Learning More From Social Experiments. New York, NY: Sage.

2009-03-01

Methods for Local Experimental Evaluation of STEM Initiatives Presented to State Legislators

The National Conference of State Legislatures (NCSL) presented a seminar for education committee chairs January 9-11 in Huntsville, Alabama, home of NASA’s Marshall Space Flight Center. The topic was “Linking Research and Policy to Improve Science, Technology, Engineering and Math Education.” Empirical Education‘s president, Denis Newman, presented the company‘s research on the Alabama Math Science and Technology Initiative, an 80-school randomized experiment being conducted as part of its contract with the Regional Education Laboratory for the Southeast. The presentation also drew on findings from experiments the company has conducted to evaluate STEM initiatives elsewhere in the country to illustrate the importance of local research goals and characteristics in evaluation design. The seminar was part of a series on research funded by the Institute of Education Sciences. (Click here for a copy of the presentation.)

2009-02-01
Archive