blog posts and news stories

How Efficacy Studies Can Help Decision-makers Decide if a Product is Likely to Work in Their Schools

We and our colleagues have been working on translating the results of rigorous studies of the impact of educational products, programs, and policies for people in school districts who are making the decisions whether to purchase or even just try out—pilot—the product. We are influenced by Stanford University Methodologist Lee Cronbach, especially his seminal book (1982) and article (1975) where he concludes “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125). In other words, we consider even the best designed experiment to be like a case study, as much about the local and moderating role of context, as about the treatment when interpreting the causal effect of the program.

Following the focus on context, we can consider characteristics of the people and of the institution where the experiment was conducted to be co-causes of the result that deserve full attention—even though, technically, only the treatment, which was randomly assigned was controlled. Here we argue that any generalization from a rigorous study, where the question is whether the product is likely to be worth trying in a new district, must consider the full context of the study.

Technically, in the language of evaluation research, these differences in who or where the product or “treatment” works are called “interaction effects” between the treatment and the characteristic of interest (e.g., subgroups of students by demographic category or achievement level, teachers with different skills, or bandwidth available in the building). The characteristic of interest can be called a “moderator”, since it changes, or moderates, the impact of the treatment. An interaction reveals if there is differential impact and whether a group with a particular characteristic is advantaged, disadvantaged, or unaffected by the product.

The rules set out by The Department of Education’s What Works Clearinghouse (WWC) focus on the validity of the experimental conclusion: Did the program work on average compared to a control group? Whether it works better for poor kids than for middle class kids, works better for uncertified teachers versus veteran teachers, increases or closes a gap between English learners and those who are proficient, are not part of the information provided in their reviews. But these differences are exactly what buyers need in order to understand whether the product is a good candidate for a population like theirs. If a program works substantially better for English proficient students than for English learners, and the purchasing school has largely the latter type of student, it is important that the school administrator know the context for the research and the result.

The accuracy of an experimental finding depends on it not being moderated by conditions. This is recognized with recent methods of generalization (Tipton, 2013) that essentially apply non-experimental adjustments to experimental results to make them more accurate and more relevant to specific local contexts.

Work by Jaciw (2016a, 2016b) takes this one step further.

First, he confirms the result that if the impact of the program is moderated, and if moderators are distributed differently between sites, then an experimental result from one site will yield a biased inference for another site. This would be the case, for example, if the impact of a program depends on individual socioeconomic status, and there is a difference between the study and inference sites in the proportion of individuals with low socioeconomic status. Conditions for this “external validity bias” are well understood, but the consequences are addressed much less often than the usual selection bias. Experiments can yield accurate results about the efficacy of a program for the sample studied, but that average may not apply either to a subgroup within the sample or to a population outside the study.

Second, he uses results from a multisite trial to show empirically that there is potential for significant bias when inferring experimental results from one subset of sites to other inference sites within the study; however, moderators can account for much of the variation in impact across sites. Average impact findings from experiments provide a summary of whether a program works, but leaves the consumer guessing about the boundary conditions for that effect—the limits beyond which the average effect ceases to apply. Cronbach was highly aware of this, titling a chapter in his 1982 book “The Limited Reach of Internal Validity”. Using terms like “unbiased” to describe impact findings from experiments is correct in a technical sense (i.e., the point estimate, on hypothetical repeated sampling, is centered on the true average effect for the sample studied), but it can impart an incorrect sense of the external validity of the result: that it applies beyond the instance of the study.

Implications of the work cited, are, first, that it is possible to unpack marginal impact estimates through subgroup and moderator analyses to arrive at more-accurate inferences for individuals. Second, that we should do so—why obscure differences by paying attention to only the grand mean impact estimate for the sample? And third, that we should be planful in deciding which subgroups to assess impacts for in the context of individual experiments.

Local decision-makers’ primary concern should be with whether a program will work with their specific population, and to ask for causal evidence that considers local conditions through the moderating role of student, teacher, and school attributes. Looking at finer differences in impact may elicit criticism that it introduces another type of uncertainty—specifically from random sampling error—which may be minimal with gross impacts and large samples, but influential when looking at differences in impact with more and smaller samples. This is a fair criticism, but differential effects may be less susceptible to random perturbations (low power) than assumed, especially if subgroups are identified at individual levels in the context of cluster randomized trials (e.g., individual student-level SES, as opposed to school average SES) (Bloom, 2005; Jaciw, Lin, & Ma, 2016).

References:
Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed.), Learning more from social experiments. New York: Russell Sage Foundation.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 116-127.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass.

Jaciw, A. P. (2016). Applications of a within-study comparison approach for evaluating bias in generalized causal inferences from comparison group studies. Evaluation Review, (40)3, 241-276. Retrieved from http://erx.sagepub.com/content/40/3/241.abstract

Jaciw, A. P. (2016). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, (40)3, 199-240. Retrieved from http://erx.sagepub.com/content/40/3/199.abstract

Jaciw, A., Lin, L., & Ma, B. (2016). An empirical study of design parameters for assessing differential impacts for students in group randomized trials. Evaluation Review. Retrieved from http://erx.sagepub.com/content/early/2016/10/14/0193841X16659600.abstract

Tipton, E. (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239-266.

2018-01-16

Presenting at AERA 2017

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in San Antonio, TX from April 27 – 30, 2017.

Research Presentations will include the following.

Increasing Accessibility of Professional Development (PD): Evaluation of an Online PD for High School Science Teachers
Authors: Adam Schellinger, Andrew P Jaciw, Jenna Lynn Zacamy, Megan Toby, & Li Lin
In Event: Promoting and Measuring STEM Learning
Saturday, April 29 10:35am to 12:05pm
Henry B. Gonzalez Convention Center, River Level, Room 7C

Abstract: This study examines the impact of an online teacher professional development, focused on academic literacy in high school science classes. A one-year randomized control trial measured the impact of Internet-Based Reading Apprenticeship Improving Science Education (iRAISE) on instructional practices and student literacy achievement in 27 schools in Michigan and Pennsylvania. Researchers found a differential impact of iRAISE favoring students with lower incoming achievement (although there was no overall impact of iRAISE on student achievement). Additionally, there were positive impacts on several instructional practices. These findings are consistent with the specific goals of iRAISE: to provide high-quality, accessible online training that improves science teaching. Authors compare these results to previous evaluations of the same intervention delivered through a face-to-face format.


How Teacher Practices Illuminate Differences in Program Impact in Biology and Humanities Classrooms
Authors: Denis Newman, Val Lazarev, Andrew P Jaciw, & Li Lin
In Event: Poster Session 5 - Program Evaluation With a Purpose: Creating Equal Opportunities for Learning in Schools
Friday, April 28 12:25 to 1:55pm
Henry B. Gonzalez Convention Center, Street Level, Stars at Night Ballroom 4

Abstract: This paper reports research to explain the positive impact in a major RCT for students in the classrooms of a subgroup of teachers. Our goal was to understand why there was an impact for science teachers but not for teachers of humanities, i.e., history and English. We have labelled our analysis “moderated mediation” because we start with the finding that the program’s success was moderated by the subject taught by the teacher and then go on to look at the differences in mediation processes depending on the subject being taught. We find that program impact teacher practices differ by mediator (as measured in surveys and observations) and that mediators are differentially associated with student impact based on context.


Are Large-Scale Randomized Controlled Trials Useful for Understanding the Process of Scaling Up?
Authors: Denis Newman, Val Lazarev, Jenna Lynn Zacamy, & Li Lin
In Event: Poster Session 3 - Applied Research in School: Education Policy and School Context
Thursday, April 27 4:05 to 5:35pm
Henry B. Gonzalez Convention Center, Ballroom Level, Hemisfair Ballroom 2

Abstract: This paper reports a large scale program evaluation that included an RCT and a parallel study of 167 schools outside the RCT that provided an opportunity for the study of the growth of a program and compare the two contexts. Teachers in both contexts were surveyed and a large subset of the questions are asked of both scale-up teachers and teachers in the treatment schools of the RCT. We find large differences in the level of commitment to program success in the school. Far less was found in the RCT suggesting that a large scale RCT may not be capturing the processes at play in the scale up of a program.

We look forward to seeing you at our sessions to discuss our research. You can also view our presentation schedule here.

2017-04-17

IES Releases New Empirical Education Report on Educator Effectiveness

Our report just released by IES examines the statistical properties of Arizona’s new multiple-measure teacher evaluation model. The study used data from the pilot in 2012-13 to explore the relationships among the system’s component measures (teacher observations, student academic progress, and stakeholder surveys). It also investigated how well the model differentiated between higher and lower performing teachers. Findings suggest that the model’s observation measure may be improved through further evaluator training and calibration, and that a single aggregated composite score may not adequately represent independent aspects of teacher performance.

The study was carried out in partnership with the Arizona Department of Education as part of our work with the Regional Education Laboratory (REL) West’s Educator Effectiveness Alliance, which includes Arizona, Utah, and Nevada Department of Education officials, as well as teacher union representatives, district administrators, and policymakers. While the analysis is specific to Arizona’s model, the study findings and methodology may be of interest to other state education agencies that are developing of implementing new multiple-measure evaluation systems. We have continued this work with additional analyses for alliance members and plan to provide additional capacity building during 2015.

2014-12-16

Empirical Presents at AERA 2012

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in Vancouver, Canada from April 13 – 17, 2012. Our presentations will span two divisions: 1) Measurement and Research Methodology and 2) Research, Evaluation and Assessment in Schools.

Research Topics will include:

Current Studies in Program Evaluation to Improve Student Achievement Outcomes

Evaluating Alabama’s Math, Science and Technology Initiative: Results of a Three-Year, State-Wide Randomized Experiment

Accommodating Data From Quasi–Experimental Design

Quantitative Approaches to the Evaluation of Literacy Programs and Instruction for Elementary and Secondary Students

We look forward to seeing you at our sessions to discuss our research. You can also download our presentation schedule here. As has become tradition, we plan to host yet another of our popular AERA receptions. Details about the reception will follow in the months to come.

2011-11-18

Recognizing Success

When the Obama-Duncan administration approaches teacher evaluation, the emphasis is on recognizing success. We heard that clearly in Arne Duncan’s comments on the release of teacher value-added modeling (VAM) data for LA Unified by the LA Times. He’s quoted as saying, “What’s there to hide? In education, we’ve been scared to talk about success.” Since VAM is often thought of as a method for weeding out low performing teachers, Duncan’s statement referencing success casts the use of VAM in a more positive light. Therefore we want to raise the issue here: how do you know when you’ve found success? The general belief is that you’ll recognize it when you see it. But sorting through a multitude of variables is not a straightforward process, and that’s where research methods and statistical techniques can be useful. Below we illustrate how this plays out in teacher and in program evaluation.

As we report in our news story, Empirical is participating in the Gates Foundation project called Measures of Effective Teaching (MET). This project is known for its focus on value-added modeling (VAM) of teacher effectiveness. It is also known for having collected over 10,000 videos from over 2,500 teachers’ classrooms—an astounding accomplishment. Research partners from many top institutions hope to be able to identify the observable correlates for teachers whose students perform at high levels as well as for teachers whose students do not. (The MET project tested all the students with an “alternative assessment” in addition to using the conventional state achievement tests.) With this massive sample that includes both data about the students and videos of teachers, researchers can identify classroom practices that are consistently associated with student success. Empirical’s role in MET is to build a web-based tool that enables school system decision-makers to make use of the data to improve their own teacher evaluation processes. Thus they will be able to build on what’s been learned when conducting their own mini-studies aimed at improving their local observational evaluation methods.

When the MET project recently had its “leads” meeting in Washington DC, the assembled group of researchers, developers, school administrators, and union leaders were treated to an after-dinner speech and Q&A by Joanne Weiss. Joanne is now Arne Duncan’s chief of staff, after having directed the Race to the Top program (and before that was involved in many Silicon Valley educational innovations). The approach of the current administration to teacher evaluation—emphasizing that it is about recognizing success—carries over into program evaluation. This attitude was clear in Joanne’s presentation, in which she declared an intention to “shine a light on what is working.” The approach is part of their thinking about the reauthorization of ESEA, where more flexibility is given to local decision- makers to develop solutions, while the federal legislation is more about establishing achievement goals such as being the leader in college graduation.

Hand in hand with providing flexibility to find solutions, Joanne also spoke of the need to build “local capacity to identify and scale up effective programs.” We welcome the idea that school districts will be free to try out good ideas and identify those that work. This kind of cycle of continuous improvement is very different from the idea, incorporated in NCLB, that researchers will determine what works and disseminate these facts to the practitioners. Joanne spoke about continuous improvement, in the context of teachers and principals, where on a small scale it may be possible to recognize successful teachers and programs without research methodologies. While a teacher’s perception of student progress in the classroom may be aided by regular assessments, the determination of success seldom calls for research design. We advocate for a broader scope, and maintain that a cycle of continuous improvement is just as much needed at the district and state levels. At those levels, we are talking about identifying successful schools or successful programs where research and statistical techniques are needed to direct the light onto what is working. Building research capacity at the district and state level will be a necessary accompaniment to any plan to highlight successes. And, of course, research can’t be motivated purely by the desire to document the success of a program. We have to be equally willing to recognize failure. The administration will have to take seriously the local capacity building to achieve the hoped-for identification and scaling up of successful programs.

2010-11-18

Making Vendor Research More Credible

The latest evidence that research can be both rigorous and relevant was the subject of an announcement that the Software and Information Industry Association (SIIA) made last month about their new guidelines for conducting effectiveness research. The document is aimed at SIIA members, most of whom are executives of education software and technology companies and not necessarily schooled in research methodology. The main goal in publishing the guidelines is to improve the quality—and therefore the credibility—of research sponsored by the industry. The document provides SIIA members with things to keep in mind when contracting for research or using research in marketing materials. The document also has value for educators, especially those responsible for purchasing decisions. That’s an important point that I’ll get back to.

One thing to make clear in this blog entry is that while your humble blogger (DN) is given credit as the author, the Guidelines actually came from a working group of SIIA members who put in many months of brainstorming, discussion, and review. DN’s primary contribution was just to organize the ideas, ensure they were technically accurate, and put them into easy to understand language.

Here’s a taste of some of the ideas contained in the 22 guidelines:

  • With a few exceptions, all research should be reported regardless of the result. Cherry picking just the studies with strong positive results distorts the facts and in the long run hurts credibility. One lesson that might be taken from this is that conducting several small studies may be preferable to trying to prove a product effective (or not) in a single study.

  • Always provide a link to the full report. Too often in marketing materials (including those of advocacy groups, not just publishers) a fact such as “8th grade math achievement increased from 31% in 2004 to 63% in 2005,” is offered with no citation. In this specific case, the fact was widely cited but after considerable digging could be traced back to a report described by the project director as “anecdotal”.

  • Be sure to take implementation into account. In education, all instructional programs require setting up complex systems of teacher-student interaction, which can vary in numerous ways. Issues of how research can support the process and what to do with inadequate or outright failed implementation must be understood by researchers and consumers of research.

  • Watch out for the control condition. In education there are no placebos. In almost all cases we are comparing a new program to whatever is in place. Depending on how well the existing program works, the program being evaluated may appear to have an impact or not. This calls for careful consideration of where to test a product and understandable concern by educators as to how well a particular product tested in another district will perform against what is already in place in their district.

The Guidelines are not just aimed at industry. SIIA believes that as decision-makers at schools begin to see a commitment to providing stronger research, their trust in the results will increase. It is also in the educators’ interest to review the guidelines because they provide a reference point for what actionable research should look like. Ultimately, the Guidelines provide educators with help in conducting their own research, whether it is on their own or in partnership with the education technology providers.

2010-06-01

Report Released on Phase Two of The Efficacy of PCI’s Reading Program

The results are in for Phase Two of a five year longitudinal efficacy trial of PCI’s Reading Program for students with moderate to severe disabilities. This research builds upon an initial randomized control trial conducted last year that found that students in the PCI program had substantial success in learning sight words in comparison to students in the control group. Phase Two continues research in the Brevard and Miami–Dade County school districts with teachers of supported-level students in grades 3-8. Using both quasi-experimental and extra-experimental methods, findings again demonstrate that students who received PCI for two years achieved significantly higher scores on the sight word assessment than students who were not exposed to the program. However, student progress through the program was slower than initially expected by the developers. Empirical will continue to collect, integrate, and analyze outcomes for three more years.

The methodological designs for this study were presented at this year’s annual SREE conference in Washington, D.C. Results for this study will also be presented at the 2010 Annual AERA Meeting in Denver, CO. Meet the research team as they describe the study in further detail during the Division C poster session on May 3.

2010-04-14

Conference Season has Arrived

Springtime marks the start of “conference season” and Empirical Education has been busy attending and preparing for the various meetings and events. We are participating in five conferences (CoSN, SIIA, SREE, NCES-MIS, and AERA) and we hope to see some familiar faces in our travels. If you will be attending any of the following meetings, please give us a call. We’d love to schedule a time to speak with you.

CoSN

The Empirical team headed to the 2010 Consortium of School Networking conference in Washington, DC at the Omni Shoreham Hotel from February 28—March 3, 2010. We were joined by Eric Lehew, Executive Director of Learning Support Services at Poway Unified School District, who co-presented with us a poster titled, “Turning Existing Data into Research” (Monday, March 1 from 1:00pm to 2:00pm). As exhibitors, Empirical Education also hosted a 15-minute vendor demonstration entitled Building Local Capacity: Using Your Own Data Systems to Easily Measure Program Effectiveness, to launch our MeasureResults tool.

SIIA

The Software & Information Industry Association held their 2010 Ed Tech Government Forum in Washington, DC on March 3–4. The focus this year was on Education Funding & Programs in a (Post) Stimulus World and included speakers, such as Secretary of Education, Arne Duncan and West Virginia Superintendent of Schools, Steven Paine.

SREE

Just as the SIIA Forum came to a close, the Society for Research on Educational Effectiveness held their annual conference—Research Into Practice—March 4-6 where our chief scientist, Andrew Jaciw, and research scientist, Xiaohui Zheng, presented their poster on estimating long-term program impacts when the control group joins treatment in the short-term. Dr. Jaciw was also named on a paper presentation with Rob Olsen of Abt Associates.

Thursday March 4, 2010
3:30pm–5:00pm: Session 2
2E. Research Methodology
Examining State Assessments
Forum
Chair: Jane Hannaway, The Urban Institute
Using State Or Study-Administered Achievement Tests in Impact Evaluations
Rob Olsen and Fatih Unlu, Abt Associates and Andrew Jaciw, Empirical Education
Friday March 5, 2010
5:00pm–7:00pm: Poster Session
Poster Session: Research Methodology
Estimating Long-Term Program Impacts When the Control Group Joins Treatment in the Short-Term: A Theoretical and Empirical Study of the Tradeoffs Between Extra- and Quasi-Experimental Estimates
Andrew Jaciw, Boya Ma, and Qingfeng Zhao, Empirical Education
View abstract

NCES-MIS

The 23rd Annual Management Information Systems (MIS) Conference was held in Phoenix, Arizona March 3-5. Co-sponsored by the Arizona Department of Education and the U.S. Department of Education’s National Center for Education Statistics (NCES), the MIS Conference brings together the people who work with information collection, management, transmittal, and reporting in school districts and state education agencies. The majority of the sessions focused on data use, data standards, statewide data systems, and data quality. For more information, refer to the program highlights.

AERA

We will have a strong showing at the American Educational Research Association annual conference in Denver, Colorado from Friday, April 30 through Tuesday, May 4. Please come talk to us at our poster and paper sessions. View our AERA presentation schedule to find out which of our presentations you would like to attend. And we hope to see you at our customary stylish reception Sunday evening, May 2 from 6 to 8:30—mark your calendars!

IES

We will be presenting at the IES Research Conference in National Harbor, MD from June 28-30. View our poster here.

2010-03-12

Empirical Education Appoints Chief Scientist

We are pleased to announce the appointment of Andrew Jaciw, Ph.D. as Empirical Education’s Chief Scientist. Since joining the company more than five years ago, Dr. Jaciw has guided and shaped our analytical and research design practices, infusing our experimental methodologies with the intellectual traditions of both Cronbach and Campbell. As Chief Scientist, he will continue to lead Empirical’s team of scientists setting direction for our MeasureResults evaluation and analysis processes, as well as basic research into widely applicable methodologies. Andrew received his Ph.D in Education from Stanford University.

2010-02-05

Four Presentations Accepted for AERA 2010

Empirical Education will be heading to the snow-capped mountains of Denver, Colorado next April! Once again, Empirical will have a strong showing at the 2010 American Educational Research Association conference, which will be held at the Colorado Convention Center on April 30 — May 4, 2010. Our presentations will span several divisions, including Learning & Instruction; Measurement & Research Methodology; and Research, Evaluation, & Assessment in Schools. Research topics will include:

  • Examining the Efficacy of a Sight-Word Reading Program for Students with Significant Cognitive Disabilities: Phase 2

  • Matched Pairs, ICCs and R-squared: Lessons from Several Effectiveness Trials in Education

  • Addressing Challenges of Within School Randomization

  • Measuring the Impact of a Math Program As It Is Rolled Out Over Several Years

In line with our past 2 years of successful AERA receptions, Empirical Education plans to host another “meet and greet” at this year’s conference as well. Join our mailing list to receive the details.

2009-12-01
Archive