blog posts and news stories

View from the West Coast: Relevance is More Important than Methodological Purity

Bob Slavin published a blog post in which he argues that evaluation research can be damaged by using the cloud-based data routinely collected by today’s education technology (edtech). We see serious flaws with this argument and it is quite clear that he directly opposes the position we have taken in a number of papers and postings, and also discussed as part of the west coast conversations about education research policy. Namely, we’ve argued that using the usage data routinely collected by edtech can greatly improve the relevance and usefulness of evaluations.

Bob’s argument is that if you use data collected during the implementation of the program to identify students and teachers who used the product as intended, you introduce bias. The case he is concerned with is in a matched comparison study (or quasi-experiment) where the researcher has to find the right matching students or classes to the students using the edtech. The key point he makes is:

“students who used the computers [or edtech product being evaluated] were more motivated or skilled than other students in ways the pretests do not detect.”

That is, there is an unmeasured characteristic, let’s call it motivation, that both explains the student’s desire to use the product and explains why they did better on the outcome measure. Since the characteristic is not measured, you don’t know which students in the control classes have this motivation. If you select the matching students only on the basis of their having the same pretest level, demographics, and other measured characteristics but you don’t match on “motivation”, you have biased the result.

The first thing to note about this concern, is that there may not be a factor such motivation that explains both edtech usage and the favorable outcome. It is just that there is a theoretical possibility that such a variable is driving the result. The bias may or may not be there and to reject a method because there is an unverifiable possibility of bias is an extreme move.

Second, it is interesting that he uses an example that seems concrete but is not at all specific to the bias mechanism he’s worried about.

“Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious.”

This isn’t a problem of an unmeasured variable at all. The problem is that the usage didn’t cause the improvement—rather, the improvement caused the usage. This would be a problem in a randomized “gold standard” experiment. The example makes it sound like the problem is “obvious” and concrete, when Bob’s concern is purely theoretical. This example is a good argument for having the kind of implementation analyses of the sort that ISTE is doing in their Edtech Advisor and Jefferson Education Exchange has embarked on.

What is most disturbing about Bob’s blog post is that he makes a statement that is not supported by the ESSA definitions or U.S. Department of Education regulations or guidance. He claims that:

“In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT [Intent To Treat, i.e., using all students in the pre-identified schools or classes where administrators intended to use the product] estimates, with the exception of randomizing at the start.”

It is true that Bob’s own site Evidence for ESSA, will not accept any study that does not follow the ITT protocol but ESSA, itself, does not require that constraint.

Essentially, Bob is throwing away relevance to school decision-makers in order to maintain an unnecessary purity of research design. School decision-makers care whether the product is likely to work with their school’s population and available resources. Can it solve their problem (e.g., reduce achievement gaps among demographic categories) if they can implement it adequately? Disallowing efficacy studies that consider compliance to a pre-specified level of usage in selecting the “treatment group” is to throw out relevance in favor or methodological purity. Yes, there is a potential for bias, which is why ESSA considers matched-comparison efficacy studies to be “moderate” evidence. But school decisions aren’t made on the basis of which product has the largest average effect when all the non-users are included. A measure of subgroup differences, when the implementation is adequate, provides more useful information.

2018-12-27

For Quasi-experiments on the Efficacy of Edtech Products, it is a Good Idea to Use Usage Data to Identify Who the Users Are

With edtech products, the usage data allows for precise measures of exposure and whether critical elements of the product were implemented. Providers often specify an amount of exposure or the kind of usage that is required to make a difference. Furthermore, educators often want to know whether the program has an effect when implemented as intended. Researchers can readily use data generated by the product (usage metrics) to identify compliant users, or to measure the kind and amount of implementation.

Since researchers generally track product implementation and statistical methods allow for adjustments for implementation differences, it is possible to estimate the impact on successful implementers, or technically, on a subset of study participants who were compliant with treatment. It is, however, very important that the criteria researchers use in setting a threshold be grounded in a model of how the program works. This will, for example, point to critical components that can be referred to in specifying compliance. Without a clear rationale for the threshold set in advance, the researcher may appear to be “fishing” for the amount of usage that produces an effect.

Some researchers reject comparison studies in which identification of the treatment group occurs after the product implementation has begun. This is based in part on the concern that the subset of users who comply with the suggested amount of usage will get more exposure to the program. More exposure will result in a larger effect. This assumes of course, that the product is effective, otherwise the students and teachers will have been wasting their time and will likely perform worse than the comparison group.

There is also the concern that the “compliers” may differ from the non-compliers (and non-users) in some characteristic that isn’t measured. And that even after controlling for measurable variables (prior achievement, ethnicity, English proficiency, etc.), there could be a personal characteristic that results in an otherwise ineffective program becoming effective for them. We reject this concern and take the position that a product’s effectiveness can be strengthened or weakened by many factors. A researcher conducting any matched comparison study can never be certain that there isn’t an unmeasured variable that is biasing it. (That’s why the What Works Clearinghouse only accepts Quasi-Experiments “with reservations.”) However, we believe that as long as the QE controls for the major factors that are known to affect outcomes, the study can meet the Every Student Succeeds Act requirement that the researcher “controls for selection bias.”

With those caveats, we believe that a QE, which identifies users by their compliance to a pre-specified level of usage, is a good design. Studies that look at the measurable variables that modify the effectiveness of a product can not only be useful for school in answering their question, “is the product likely to work in my school?” but points the developer and product marketer to ways the product can be improved.

2018-07-27

Presenting at AERA 2017

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in San Antonio, TX from April 27 – 30, 2017.

Research Presentations will include the following.

Increasing Accessibility of Professional Development (PD): Evaluation of an Online PD for High School Science Teachers
Authors: Adam Schellinger, Andrew P Jaciw, Jenna Lynn Zacamy, Megan Toby, & Li Lin
In Event: Promoting and Measuring STEM Learning
Saturday, April 29 10:35am to 12:05pm
Henry B. Gonzalez Convention Center, River Level, Room 7C

Abstract: This study examines the impact of an online teacher professional development, focused on academic literacy in high school science classes. A one-year randomized control trial measured the impact of Internet-Based Reading Apprenticeship Improving Science Education (iRAISE) on instructional practices and student literacy achievement in 27 schools in Michigan and Pennsylvania. Researchers found a differential impact of iRAISE favoring students with lower incoming achievement (although there was no overall impact of iRAISE on student achievement). Additionally, there were positive impacts on several instructional practices. These findings are consistent with the specific goals of iRAISE: to provide high-quality, accessible online training that improves science teaching. Authors compare these results to previous evaluations of the same intervention delivered through a face-to-face format.


How Teacher Practices Illuminate Differences in Program Impact in Biology and Humanities Classrooms
Authors: Denis Newman, Val Lazarev, Andrew P Jaciw, & Li Lin
In Event: Poster Session 5 - Program Evaluation With a Purpose: Creating Equal Opportunities for Learning in Schools
Friday, April 28 12:25 to 1:55pm
Henry B. Gonzalez Convention Center, Street Level, Stars at Night Ballroom 4

Abstract: This paper reports research to explain the positive impact in a major RCT for students in the classrooms of a subgroup of teachers. Our goal was to understand why there was an impact for science teachers but not for teachers of humanities, i.e., history and English. We have labelled our analysis “moderated mediation” because we start with the finding that the program’s success was moderated by the subject taught by the teacher and then go on to look at the differences in mediation processes depending on the subject being taught. We find that program impact teacher practices differ by mediator (as measured in surveys and observations) and that mediators are differentially associated with student impact based on context.


Are Large-Scale Randomized Controlled Trials Useful for Understanding the Process of Scaling Up?
Authors: Denis Newman, Val Lazarev, Jenna Lynn Zacamy, & Li Lin
In Event: Poster Session 3 - Applied Research in School: Education Policy and School Context
Thursday, April 27 4:05 to 5:35pm
Henry B. Gonzalez Convention Center, Ballroom Level, Hemisfair Ballroom 2

Abstract: This paper reports a large scale program evaluation that included an RCT and a parallel study of 167 schools outside the RCT that provided an opportunity for the study of the growth of a program and compare the two contexts. Teachers in both contexts were surveyed and a large subset of the questions are asked of both scale-up teachers and teachers in the treatment schools of the RCT. We find large differences in the level of commitment to program success in the school. Far less was found in the RCT suggesting that a large scale RCT may not be capturing the processes at play in the scale up of a program.

We look forward to seeing you at our sessions to discuss our research. You can also view our presentation schedule here.

2017-04-17

Empirical Presents at AERA 2012

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in Vancouver, Canada from April 13 – 17, 2012. Our presentations will span two divisions: 1) Measurement and Research Methodology and 2) Research, Evaluation and Assessment in Schools.

Research Topics will include:

Current Studies in Program Evaluation to Improve Student Achievement Outcomes

Evaluating Alabama’s Math, Science and Technology Initiative: Results of a Three-Year, State-Wide Randomized Experiment

Accommodating Data From Quasi–Experimental Design

Quantitative Approaches to the Evaluation of Literacy Programs and Instruction for Elementary and Secondary Students

We look forward to seeing you at our sessions to discuss our research. You can also download our presentation schedule here. As has become tradition, we plan to host yet another of our popular AERA receptions. Details about the reception will follow in the months to come.

2011-11-18

Quasi-experimental Design Used to Build Evidence for Adolescent Reading Intervention

A study of Jamestown Reading Navigator (JRN) from McGraw-Hill (now posted on our reports page), conducted in Miami-Dade County Public Schools, found positive results on the Florida state reading test (FCAT) for high school students in their intensive reading classes. JRN is an online application, with internal record keeping making it possible to identify the treatment group for a comparison design. While the full student, teacher and roster data for 9th and 10th grade intensive reading classes were provided by the district, JRN—as an online application—provided the identification of the student and teacher users through the computer logs. The quasi-experimental design was strengthened by using schools with both JRN and non-JRN students. Of the 70 schools that had JRN logs, 23 had JRN and non-JRN intensive reading classes and sufficient data for analysis.

Download the 2010 report here.

2011-04-15

Report Released on Phase Two of The Efficacy of PCI’s Reading Program

The results are in for Phase Two of a five year longitudinal efficacy trial of PCI’s Reading Program for students with moderate to severe disabilities. This research builds upon an initial randomized control trial conducted last year that found that students in the PCI program had substantial success in learning sight words in comparison to students in the control group. Phase Two continues research in the Brevard and Miami–Dade County school districts with teachers of supported-level students in grades 3-8. Using both quasi-experimental and extra-experimental methods, findings again demonstrate that students who received PCI for two years achieved significantly higher scores on the sight word assessment than students who were not exposed to the program. However, student progress through the program was slower than initially expected by the developers. Empirical will continue to collect, integrate, and analyze outcomes for three more years.

The methodological designs for this study were presented at this year’s annual SREE conference in Washington, D.C. Results for this study will also be presented at the 2010 Annual AERA Meeting in Denver, CO. Meet the research team as they describe the study in further detail during the Division C poster session on May 3.

2010-04-14

Empirical Education Partners with NWEA to Research Virtual Control Groups

Northwest Evaluation Association, the leading provider of computer adaptive testing for schools, is partnering with Empirical Education to analyze the properties of its virtual control group (VCG) technologies. Empirical has already conducted a large number of randomized experiments in which NWEA’s “Measures of Academic Progress” (MAP) served both as pretest and posttest. The characteristics of a randomly assigned control group provide a yardstick in evaluating the characteristics of the VCG. The proposed research builds on extensive theoretical work on approaches to forming comparison groups for obtaining unbiased impact estimates from quasi-experiments.

In parallel to this theoretical analysis, NWEA and Empirical Education are cooperating in a nationwide comparison group (“quasi-”) experiment to estimate the impact of a basal reading program in wide use nationally. Taking advantage of the fact that MAP is in use in thousands of schools, Empirical will identify a group of schools currently using this reading program and testing their students’ reading using MAP and then select a well matched comparison group from non-users who also test with MAP. Characteristics of the schools such as SES, percent English learner, urbanicity, ethnicity, and geographic region, as well as prior reading achievement, will be used in identifying the comparison group.

2009-03-09

Empirical Education Focuses on Local Characteristics at the 14th Annual CREATE Conference

Empirical Education staff presented at the National Evaluation Institute’s (NEI) 14th annual CREATE conference in Wilmington, North Carolina. Both presentations focused on the local characteristics of the evaluations. Dr. Denis Newman, president of Empirical Education, and Jenna Zacamy, research manager, presented a randomized experiment which evaluated the impact of a pre-algebra curriculum (Carnegie Learning’s Cognitive Tutor Bridge to Algebra) being introduced in a pilot program in the Maui School District. The district adopted the program based in part on previous research showing substantial positive results in Oklahoma (Morgan & Ritter 2002). Given the unique locale and ethnic makeup in Maui, a local evaluation was warranted. District educators were concerned in particular with their less experienced teachers and with ethnic groups considered at risk. Unlike in prior research, we found no overall impact although for the algebraic operations subscale, low scoring students benefited from being in the Cognitive Tutor classes indicating that the new program could help to reduce the achievement gaps of concern. We also found for the overall math scale that uncertified teachers were more successful with their Cognitive Tutor classes than their conventional classes. Dr. Newman also presented work co-authored with Marco Muñoz and Andrew Jaciw on a quasi-experimental comparison, conducted by Empirical Education and Jefferson County (KY) schools, of an activity-based middle-school science program (Premier Science) to more traditional textbook programs. All the data were supplied by the district including a rating of quality of implementation. The primary pretest and outcome measures were tests of science and reading achievement. While there was no discernible difference overall, poor readers gained more from the non-textbook approach, helping to diminish an achievement gap of concern to the district.

2008-12-15

Five Presentations Accepted for AERA 2009

Empirical Education will be heading to sunny San Diego next April! Once again, Empirical will have a strong showing at the 2009 American Educational Research Association conference, which will be held in downtown San Diego on April 13-17, 2009. Our presentations will span several divisions, including Learning & Instruction, Measurement & Research Methodology, and Research, Evaluation, & Assessment in Schools. Research topics will include:

As a follow-up to our successful 2008 AERA New York reception at Henri Bendel’s Chocolate Room, Empirical Education plans to host another “meet and greet” at this year’s conference as well. Details about the reception will be announced on our website soon.

2008-12-01
Archive