blog posts and news stories

Validating Research that Helps Reduce Achievement Gaps

This is the third of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we show how issues of bias affecting the NCLB-era average impact estimates are not necessarily inherited by the differential subgroup estimates. (Read the first and the second parts of the story here.)

There’s a joke among researchers: “When a study is conducted in Brooklyn and in San Diego the average results should apply to Wichita.”

Following the NCLB-era rules, researchers usually put all their resources for a study into the primary result, providing a report on the average effectiveness of a product or tool across all populations. This leads to disregarding subgroup differences and misses the opportunity to discover that the program studied may work better or worse for certain populations. A particularly strong example is from our own work where this philosophy led to a misleading conclusion. While the program we studied was celebrated as working on average, it turned out not to help the Black kids. It widened an existing achievement gap. Our point is that differential impacts are not just extras, they should be the essential results of school research.

In many cases we find that a program works well for some students and not so much for others. In all cases the question is whether the program increases or decreases an existing gap. Researchers call this differential impact an interaction between the characteristics of the people in the study and the program, or they call it a moderated impact as in: the program effect is moderated by the characteristic. If the goal is to narrow an achievement gap, the difference between subgroups (for example: English language learners, kids in free lunch programs, or girls versus boys) in the impact provides the most useful information.

Examining differential impacts across subgroups also turns out to be less subject to the kinds of bias that have concerned NCLB-era researchers. In a recent paper, Andrew Jaciw showed that with matched comparison designs, estimation of the differential effect of a program on contrasting subgroups of individuals can be less susceptible to bias than the researcher’s estimate of the average effect. Moderator effects are less prone to certain forms of selection bias. In this work, he develops the idea that when evaluating differential effects using matched comparison studies that involve cross-site comparisons, standard selection bias is “differenced away” or negated. While a different form of bias may be introduced, he shows empirically that it is considerably smaller. This is a compelling and surprising result and speaks to the importance of making moderator effects a much greater part of impact evaluations. Jaciw finds that the differential effects for subgroups do not necessarily inherit the biases that are found in average effects, which were the core focus of the NCLB era.

Therefore, matched comparison studies, may be less biased than one might think for certain important quantities. On the other hand, RCTs, which are often believed to be without bias (thus, “gold standard”) may be biased in ways that are often overlooked. For instance, results from RCTs may be limited by selection based on who chooses to participate. The teachers and schools who agree to be part of the RCT might bias results in favor of those more willing to take risks and try new things. In that case, the results wouldn’t generalize to less adventurous teachers and schools.

A general advantage of RCEs (in the form of matched comparison experiments) have over RCTs is they can be conducted under more true-to-life circumstances. If using existing data, outcomes reflect results from field implementations as they happened in real life. Such RCEs can be performed more quickly and at a lower cost than RCTs. These can be used by school districts, which have paid for a pilot implementation of a product and want to know in June whether the program should be expanded in September. The key to this kind of quick turn-around, rapid-cycle evaluation is to use data from the just completed school year rather than following the NCLB-era habit of identifying schools that have never implemented the program and assigning teachers as users and non-users before implementation begins. Tools, such as the RCE Coach (now the Evidence to Insights Coach) and Evidentally’s Evidence Suite, are being developed to support district-driven as well as developer-driven matched comparisons.

Commercial publishers have also come under criticism for potential bias. The Hechinger Report recently published an article entitled: Ed tech companies promise results, but their claims are often based on shoddy research. Also from the Hechinger Report, Jill Barshay offered a critique entitled The dark side of education research: widespread bias. She cites a working paper that was recently published in a well-regarded journal by a Johns Hopkins team led by Betsy Wolf, who worked with reports of studies within the WWC database (all using either RCTs or matched comparisons). Wolf compared differences in results where some studies were paid for by the program developer and others paid for through independent sources (such as IES grants to research organizations). Wolf’s study found the size of the effect based on source of funding substantially favored developer sponsored studies. The most likely explanation was that developers are more prone to avoid publishing unfavorable results. This is called a “developer effect.” While we don’t doubt that Wolf found a real difference, the interpretation and importance of the bias can be questioned.

First while more selective reporting by developers may bias their reporting upward, other biases may lead to smaller effects for independently-funded research. Following NCLB-era rules, independently-funded researchers must convince school districts to use the new materials or programs to be studied. But many developer-driven studies are conducted where the materials or program being studied is already in use (or being considered for use and established as a good fit for adoption). The bias to the average overall effect size from a lack of inherent interest might result in a lower effect estimate.

When a developer is budgeting for an evaluation, working in a district that has already invested in the program and succeeded in its implementation is often the best way to provide the information that other districts need since it shows not just the outcomes but a case study of an implementation. Results from a district that has chosen a program for a pilot and succeeded in its implementation may not be an unfair bias. While the developer selected districts with successful pilots may score higher than districts recruited by an independently funded researcher, they are also more likely to have commonalities with districts interested in adopting the program. Recruiting schools with no experience with the program may bias the results to be lower than they should be.

Second, the fact that bias was found in the standard NCLB-era average between the user and non-user groups provides another reason to drop the primacy of the overall average and put our focus on the subgroup moderator analysis where there may be less bias. Average outcomes across all populations has little information value for school district decision-makers. Moderator effects are what they need if their goal is to reduce rather than enlarge an achievement gap.

We have no reason to assume that the information school decision-makers need has inherited the same biases that have been demonstrated in the use of developer-driven Tier 2 and 3 studies in evaluation of programs. We do see that the NCLB-era habit of ignoring subgroup differences reinforces the status quo and hides achievement gaps in K-12 schools.

Next, we advocate replacing the NCLB-era single study with meta-analyses of many studies where the focus is on the moderators.

2020-06-26

ESSA’s Evidence Tiers and Potential for Bias

This is the second of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we explain how ESSA introduced flexibility and how NCLB-era habits have raised issues about bias. (Read the first one here.)

The tiers of evidence defined in the Every Student Succeeds Act (ESSA) give schools and researchers greater flexibility, but are not without controversy. Flexibility creates the opportunity for biased results. The ESSA law, for example, states that studies must statistically control for “selection bias”, recognizing that teachers who “select” to use a program may have other characteristics that give them an advantage and the results for those teachers could be biased upward. As we trace the problem of bias it is useful to go back to the interpretation of ESSA that originated with the NCLB-era approach to research.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based educational products that automatically report usage data, it is important to clarify both ESSA’s useful advances and how the four tiers fail to address a critical scientific concept needed for schools to make use of research.

We’ve written elsewhere how the ESSA tiers of evidence form a developmental scale. The four tiers give educators as well as developers of educational materials and products an easier way to start examining effectiveness without making the commitment to the type of scientifically-based research that NCLB once required.

We think of the four tiers of evidence defined in ESSA as a pyramid as shown in this figure.

ESSA levels of evidence pyramid

  1. RCT. At the apex is Tier 1, defined by ESSA as a randomized control trial (RCT), considered the gold standard in the NCLB era.
  2. Matched Comparison or “quasi-experiments”. With Tier 2 the WWC also allowed for less rigorous experimental research design, such as matched comparisons or quasi-experiments (QE) where schools, teachers, and students (experimental units) independently chose to engage in the program. QEs are permitted but accepted “with reservations” because without random assignment there is the possibility of “selection bias.” For example, teachers who do well at preparing kids for tests might be more likely to participate in a new program than teachers who don’t excel at test preparation. With an RCT we can expect that such positive traits are equally distributed in the experiment between users and non-users.
  3. Correlational. Tier 3 is an important and useful addition to evidence, as a weaker but readily achieved method once the developer has a product running in schools. At that point, they have an opportunity to see if critical elements of the program correlate with outcomes of interest. This provides promising evidence, which is useful for both improving the product and giving the schools some indication that it is helping. This evidence suggests that it might be worthwhile to follow up with a tier 2 study for more definitive results.
  4. Rationale. The base level or Tier 4 is the expectation that any product should have a rationale based on learning science for why it is likely to work. Schools will want this basic rationale for why a program should work before trying it out. Our colleagues at Digital Promise have announced a service in which developers are certified as meeting Tier 4 standards.

Each subsequent tier of evidence (from number 4 to 1) improves what’s considered the “rigor” of the research design. It is important to understand that the hierarchy has nothing to do with whether the results can be generalized from the setting of the study to the district where the decision-maker resides.

While the NCLB-era focus on strong design puts emphasis on the Tier 1 RCT, we see Tiers 2 and 3 as an opportunity for lower cost and faster-turn-around “rapid-cycle evaluations” (RCE.) Tier 1 RCTs have given education research a well-deserved reputation as slow and expensive. It can take one to two years to complete an RCT, with additional time needed for data collection, analysis, and reporting. This extensive work also includes recruiting districts that are willing to participate in the RCT and often puts the cost of the study in the millions of dollars. We have conducted dozens of RCTs following the NCLB-era rules, but advocate less expensive studies in order to get the volume of evidence schools need. In contrast to an RCT, an RCE can use existing data from a school system can be both faster and far less expensive.

There is some controversy about whether schools should use lower-tier evidence, which might be subject to “selection bias.” Randomized control trials are protected from selection bias since users and non-user are assigned randomly, whether they like it or not. It is well known and has been recently pointed out by Robert Slavin that using a matched comparison, a study where teachers chose to participate in the pilot of a product, can result in unmeasured variables, technically “confounders” that affect outcomes. These variables are associated with the qualities that motivate a teacher to pursue pilot studies and their desire to excel in teaching. The comparison group may lack these characteristics that help the self-selected program users succeed. Studies of Tiers 2 and 3 will always have, by definition, unmeasured variables that may act as confounders.

While obviously a concern, there are ways that researchers can statistically control important characteristics associated with selection to use a program. For example, the amount of a teacher’s motivation to use edtech products can be controlled by collecting information from the prior year on the amount of usage by the teacher and students of a full set of products. Past studies looking at the conditions under which there is correspondence between results of RCTs and matched comparison studies that evaluate the impact of a given program have established that it is exactly “focal” variables such as motivation, that are influential confounders. Controlling for a teacher’s demonstrated motivation and students’ incoming achievement may go very far in adjusting away bias. We suggest this in a design memo for a study now being undertaken. This statistical control meets the ESSA requirement for Tiers 2 and 3.

We have a more radical proposal for controlling all kinds of bias that we address in the next posting in this series.

2020-06-22

Ending a Two-Decade-Old Research Legacy

This is the first of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. This post focuses on research methods established in the last 20 years and the reason that much of that research hasn’t been useful to schools. Subsequent posts will present an approach that works for schools.

With the COVID-19 crisis and school closures in full swing, use of edtech is predicted to not just continue, but expand when the classrooms they have replaced come back in use. There is little information available about which edtech products work for whom and under what conditions and the lack of evidence is noted. As a research organization specializing in rigorous program evaluations, Empirical Education, working with Evidentally, Inc., has been developing methods for providing schools with useful and affordable information.

The “modern” era of education research was established with No Child Left Behind, the education law passed in 2002, which established the Institute of Education Sciences (IES). The declaration of “scientifically-based research” in NCLB sparked a major undertaking assigned to the IES. To meet NCLB’s goals for improvement, it was essential to overthrow education research methods that lacked the practical goal of determining whether programs, products, practices, or policies moved the needle for an outcome of interest. These outcomes could be test scores or other measures, such as discipline referrals, that the schools were concerned with.

The kinds of questions that the learning sciences of the time answered were different: for example, how can we compare tests with tasks outside of the test? Or what is the structure of the classroom dialogue? Instead of these qualitative methods, IES looked to the medical field and other areas of policy like workforce training where randomization was used to assign subjects to a treatment group and a control group. Statistical summaries were then used to decide whether the researcher can reasonably conclude that a difference of the magnitude observed is big enough that it is unlikely to happen without a real effect.

In an attempt to mimic the medical field, IES set up the What Works Clearinghouse (WWC), which became the arbiter of acceptable research. WWC focused on getting the research design right. Many studies would be disqualified, with very few, and sometimes just one, meeting design requirements considered acceptable to provide valid causal evidence. This led to the idea that all programs, products, practices, or policies needed at least one good study that would prove its efficacy. The focus on design (or “internal validity”) was to the exclusion of generalizability (or “external validity”). Our team has conducted dozens of RCTs and appreciates the need for careful design. But we try to keep in mind the information that schools need.

Schools need to know whether the current version of the product will work, when implemented using district resources and with its specific student and teacher populations. A single acceptable study may have been conducted a decade ago with ethnicities different from the district that is looking for evidence of what will work for them. The WWC has worked hard to be fair to each intervention, especially where there are multiple studies that meet their standards. But, the key is that each separate study comes with an average conclusion. While the WWC notes the demographic composition of the subjects in the study, difference results for subgroups when tested by the study are considered secondary.

The Every Student Succeeds Act (ESSA) was passed with bi-partisan support in late 2015 and replaced NCLB. ESSA replaced scientifically-based research required by NCLB with four evidence tiers. While this was an important advance, ESSA retained the WWC as providing the definition of the top two tiers. The WWC, which remains the arbiter of evidence validity, gives the following explanation of the purpose of the evidence tiers.

“Evidence requirements under the Every Student Succeeds Act (ESSA) are designed to ensure that states, districts, and schools can identify programs, practices, products, and policies that work across various populations.”

The key to ESSA’s failings is in the final clause: “that work across various populations.” As a legacy of the NCLB-era, the WWC is only interested in the average impact across the various populations in the study. The problem is that district or school decision-makers need to know if the program will work in their schools given their specific population and resources.

The good news is that the education research community is recognizing that ignoring the population, region, product, and implementation differences is no longer necessary. Strict rules were needed to make the paradigm shift to the RCT-oriented approach. But now that NCLB’s paradigm has been in place for almost two decades, and generations of educational researchers have been trained in it, we can broaden the outlook. Mark Schneider, IES’s current director, defines the IES mission as being “in the business of identifying what works for whom under what conditions.” This framing is a move toward a broader focus with more relevant results. Researchers are looking at how to generalize results; for example, the Generalizer tool developed by Tipton and colleagues uses demographics of a target district to generate an applicability metric. The Jefferson Education Exchange’s EdTech Genome Project has focused on implementation models as an important factor in efficacy.

Our methods move away from the legacy of the last 20 years to lower the cost of program evaluations, while retaining the scientific rigor and avoiding biases that give schools misleading information. Lowering cost makes it feasible for thousands of small local studies to be conducted on the multitude of school products. Instead of one or a small handful of studies for each product, we can use a dozen small studies encompassing the variety of contexts so that we can determine for whom and under what conditions the product works.

Read the second part of this series next.

2020-06-02
Archive