The rebellion against the old NCLB way of doing efficacy research is gaining force. A growing community among edtech developers, funders, researchers, and school users has been meeting in an attempt to reach a consensus on an alternative built on ESSA.
This is being assisted by openness in the directions currently being pursued by IES. In fact, we are moving into a new phase marked by two-way communication with the regime. While the rebellion hasn’t yet handed over its lightsabers, it is encouraged by the level of interest from prominent researchers.
From these ongoing discussions, there have been some radical suggestions inching toward consensus. A basic idea now being questioned is this:
The difference between the average of the treatment group and the average of the control group is a valid measure of effectiveness.
There are two problems with this:
- In schools, there’s no “placebo” or something that looks like a useful program but is known to have zero effectiveness. Whatever is going on in the schools, or classes, or with teachers and students in the control condition has some usefulness or effectiveness. The usefulness of the activities in the control classes or schools may be greater than the activities being evaluated in the study, or may be not as useful. The study may find that the “effectiveness” of the activities being studied is positive, negative, or too small to be discerned statistically by the study. In any case, the size (negative or positive) of the effect is determined as much by what’s being done in the control group as the treatment group.
- Few educational activities have the same level of usefulness for all teachers and students. Looking at only the average will obscure the differences. For example, we ran a very large study for the U.S. Department of Education of a STEM program where we found, on average, the program was effective. What the department didn’t report was that it only worked for the white kids, not the black kids. The program increased instead of reducing the existing achievement gap. If you are considering adopting this STEM program, the impact on the different subgroups is relevant–a high minority school district may want to avoid it. Also, to make the program better, the developers need to know where it works and where it doesn’t. Again, the average impact is not just meaningless but also can be misleading.
A solution to the overuse of the average difference from studies is to conduct a lot more studies. The price the ED paid for our large study could have paid for 30 studies of the kind we are now conducting in the same state of the same program; in 10% of the time of the original study. If we had 10 different studies for each program, where studies are conducted in different school districts with different populations and levels of resources, the “average” across these studies start to make sense. Importantly, the average across these 10 studies for each of the subgroups will give a valid picture of where, how, and with which students and teachers the program tends to work best. This kind of averaging used in research is called meta-analysis and allows many small differences found across studies to build on the power of each study to generate reliable findings.
If developers or publishers of the products being used in schools took advantage of their hundreds of implementations to gather data, and if schools would be prepared to share student data for this research, we could have researcher findings that both help schools decide what will likely work for them and help developers improve their products.