blog posts and news stories

Going Beyond the NCLB-Era to Reduce Achievement Gaps

We just published on Medium an important article that traces the recent history of education research to show how an unfortunate legacy of NCLB has weakened research methods, as applied to school use of edtech, and made invisible resulting achievement gaps. This article was originally a set of four blog posts by CEO Denis Newman and Chief Scientist Andrew Jaciw. The article shows how the legacy belief that differential subgroup effects (e.g., based on poverty, prior achievement, minority status, English proficiency) found in experiments are, at best, a secondary exploration that has left serious achievement gaps unexamined. And the false belief that only studies based on data collected before program implementation are free of misleading biases has given research the warranted reputation as very slow and costly. Instead, we present a rationale for low-cost and fast-turnaround studies using cloud-based edtech usage data combined with already collected school district administrative data. Working in districts that have already implemented the program lowers the cost to the point that a dozen small studies each examining subgroup effects, which Jaciw has shown to be relatively unbiased, can be combined to produce generalizable results. These results are what school decision-makers need in order to purchase edtech that works for all their students.

Read the article on medium here.

Or read the 4-part blog series we posted this past summer.

  1. Ending a Two-Decade Research Legacy

  2. ESSA Evidence Tiers and Potential for Bias

  3. Validating Research that Helps Reduce Achievement Gaps

  4. Putting Many Small Studies Together

2020-09-16

Putting Many Small Studies Together

This is the last of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision makers. Here we show how lots of small studies can give better evidence to resolve achievement gaps. To read the the first 3 parts, use these links.

1. Ending a Two-Decade Research Legacy

2. ESSA Evidence Tiers and Potential for Bias

3. Validating Research that Helps Reduce Achievement Gaps

The NCLB-era of the single big study should be giving way to the analysis of the differential impacts for subgroups from multiple studies. This is the information that schools need in order to reduce achievement gaps. Today’s technology landscape is ready for this major shift in the research paradigm. The school shutdowns resulting from the COVID-19 pandemic have demonstrated that the value of edtech products goes beyond just the cost reduction of eliminating expensive print materials. Over the last decade digital learning products have collected usage data which provides rich and systematic evidence of how products are being used and by whom. At the same time, schools have accumulated huge databases of digital records on demographics and achievement history, with public data at a granularity down to the grade-level. Using today’s “big data” analytics, this wealth of information can be put to work for a radical reduction in the cost of showing efficacy.

Fast turnaround, low cost research studies will enable hundreds of studies to be conducted providing information to school decision-makers that answer their questions. Their questions are not just “which program, on average, produces the largest effect?” Their questions are “which program is most likely to work in my district, with my kids and teachers, and with my available resources, and which are most likely to reduce gaps of greatest concern?”

Meta-analysis is a method for combining multiple studies to increase generalizability (Shadish, Cook, & Campbell, 2002). With meta-analysis, we can test for stability of effects across sites and synthesize those results, where warranted, based on specific statistical criteria. While moderator analysis is considered merely exploratory in the NCLB-era, using meta-analysis, moderator results from multiple small studies, can in combination provide confirmation of a differential impact. Meta-analysis, or other approaches to research synthesis, combined with big data present new opportunities to move beyond the NCLB-era philosophy that prizes the single big study to prove the efficacy of a program.

While addressing WWC and ESSA standards, we caution, that a single study in one school district, or even several studies in several school districts, may not provide enough useful information to generalize to other school districts. For research to be the most effective, we need studies in enough districts to represent the full diversity of relevant populations. Studies need to systematically include moderator analysis for an effective way to generalize impact for subgroups.

The definitions provided in ESSA do not address how much information is needed to generalize from a particular study for implementation in other school districts. While we accept that well-designed Tier 2 or 3 studies are necessary to establish an appropriate level of rigor, we do not believe a single study is sufficient to declare a program will be effective across varied populations. We note that the Standards for Excellence in Education Research (SEER) recently adopted by the IES, call for facilitating generalizability.

After almost two decades of exclusive focus on the design of the single study we need to more effectively address achievement gaps with the specifics that school decision-makers need. Lowering the cost and turn-around time for research studies that break out subgroup results is entirely feasible. With enough studies qualified for meta-analysis, a new wealth of information will be available to educators who want to select the products that will best serve their students. This new order will democratize learning across the country, reducing inequities and raising student achievement in K-12 schools.

2020-07-07

Validating Research that Helps Reduce Achievement Gaps

This is the third of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we show how issues of bias affecting the NCLB-era average impact estimates are not necessarily inherited by the differential subgroup estimates. (Read the first and the second parts of the story here.)

There’s a joke among researchers: “When a study is conducted in Brooklyn and in San Diego the average results should apply to Wichita.”

Following the NCLB-era rules, researchers usually put all their resources for a study into the primary result, providing a report on the average effectiveness of a product or tool across all populations. This leads to disregarding subgroup differences and misses the opportunity to discover that the program studied may work better or worse for certain populations. A particularly strong example is from our own work where this philosophy led to a misleading conclusion. While the program we studied was celebrated as working on average, it turned out not to help the Black kids. It widened an existing achievement gap. Our point is that differential impacts are not just extras, they should be the essential results of school research.

In many cases we find that a program works well for some students and not so much for others. In all cases the question is whether the program increases or decreases an existing gap. Researchers call this differential impact an interaction between the characteristics of the people in the study and the program, or they call it a moderated impact as in: the program effect is moderated by the characteristic. If the goal is to narrow an achievement gap, the difference between subgroups (for example: English language learners, kids in free lunch programs, or girls versus boys) in the impact provides the most useful information.

Examining differential impacts across subgroups also turns out to be less subject to the kinds of bias that have concerned NCLB-era researchers. In a recent paper, Andrew Jaciw showed that with matched comparison designs, estimation of the differential effect of a program on contrasting subgroups of individuals can be less susceptible to bias than the researcher’s estimate of the average effect. Moderator effects are less prone to certain forms of selection bias. In this work, he develops the idea that when evaluating differential effects using matched comparison studies that involve cross-site comparisons, standard selection bias is “differenced away” or negated. While a different form of bias may be introduced, he shows empirically that it is considerably smaller. This is a compelling and surprising result and speaks to the importance of making moderator effects a much greater part of impact evaluations. Jaciw finds that the differential effects for subgroups do not necessarily inherit the biases that are found in average effects, which were the core focus of the NCLB era.

Therefore, matched comparison studies, may be less biased than one might think for certain important quantities. On the other hand, RCTs, which are often believed to be without bias (thus, “gold standard”) may be biased in ways that are often overlooked. For instance, results from RCTs may be limited by selection based on who chooses to participate. The teachers and schools who agree to be part of the RCT might bias results in favor of those more willing to take risks and try new things. In that case, the results wouldn’t generalize to less adventurous teachers and schools.

A general advantage of RCEs (in the form of matched comparison experiments) have over RCTs is they can be conducted under more true-to-life circumstances. If using existing data, outcomes reflect results from field implementations as they happened in real life. Such RCEs can be performed more quickly and at a lower cost than RCTs. These can be used by school districts, which have paid for a pilot implementation of a product and want to know in June whether the program should be expanded in September. The key to this kind of quick turn-around, rapid-cycle evaluation is to use data from the just completed school year rather than following the NCLB-era habit of identifying schools that have never implemented the program and assigning teachers as users and non-users before implementation begins. Tools, such as the RCE Coach (now the Evidence to Insights Coach) and Evidentally’s Evidence Suite, are being developed to support district-driven as well as developer-driven matched comparisons.

Commercial publishers have also come under criticism for potential bias. The Hechinger Report recently published an article entitled: Ed tech companies promise results, but their claims are often based on shoddy research. Also from the Hechinger Report, Jill Barshay offered a critique entitled The dark side of education research: widespread bias. She cites a working paper that was recently published in a well-regarded journal by a Johns Hopkins team led by Betsy Wolf, who worked with reports of studies within the WWC database (all using either RCTs or matched comparisons). Wolf compared differences in results where some studies were paid for by the program developer and others paid for through independent sources (such as IES grants to research organizations). Wolf’s study found the size of the effect based on source of funding substantially favored developer sponsored studies. The most likely explanation was that developers are more prone to avoid publishing unfavorable results. This is called a “developer effect.” While we don’t doubt that Wolf found a real difference, the interpretation and importance of the bias can be questioned.

First while more selective reporting by developers may bias their reporting upward, other biases may lead to smaller effects for independently-funded research. Following NCLB-era rules, independently-funded researchers must convince school districts to use the new materials or programs to be studied. But many developer-driven studies are conducted where the materials or program being studied is already in use (or being considered for use and established as a good fit for adoption). The bias to the average overall effect size from a lack of inherent interest might result in a lower effect estimate.

When a developer is budgeting for an evaluation, working in a district that has already invested in the program and succeeded in its implementation is often the best way to provide the information that other districts need since it shows not just the outcomes but a case study of an implementation. Results from a district that has chosen a program for a pilot and succeeded in its implementation may not be an unfair bias. While the developer selected districts with successful pilots may score higher than districts recruited by an independently funded researcher, they are also more likely to have commonalities with districts interested in adopting the program. Recruiting schools with no experience with the program may bias the results to be lower than they should be.

Second, the fact that bias was found in the standard NCLB-era average between the user and non-user groups provides another reason to drop the primacy of the overall average and put our focus on the subgroup moderator analysis where there may be less bias. Average outcomes across all populations has little information value for school district decision-makers. Moderator effects are what they need if their goal is to reduce rather than enlarge an achievement gap.

We have no reason to assume that the information school decision-makers need has inherited the same biases that have been demonstrated in the use of developer-driven Tier 2 and 3 studies in evaluation of programs. We do see that the NCLB-era habit of ignoring subgroup differences reinforces the status quo and hides achievement gaps in K-12 schools.

In the next and final portion of this four-part blog series, we advocate replacing the NCLB-era single study with meta-analyses of many studies where the focus is on the moderators.

2020-06-26

ESSA’s Evidence Tiers and Potential for Bias

This is the second of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we explain how ESSA introduced flexibility and how NCLB-era habits have raised issues about bias. (Read the first one here.)

The tiers of evidence defined in the Every Student Succeeds Act (ESSA) give schools and researchers greater flexibility, but are not without controversy. Flexibility creates the opportunity for biased results. The ESSA law, for example, states that studies must statistically control for “selection bias”, recognizing that teachers who “select” to use a program may have other characteristics that give them an advantage and the results for those teachers could be biased upward. As we trace the problem of bias it is useful to go back to the interpretation of ESSA that originated with the NCLB-era approach to research.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based educational products that automatically report usage data, it is important to clarify both ESSA’s useful advances and how the four tiers fail to address a critical scientific concept needed for schools to make use of research.

We’ve written elsewhere how the ESSA tiers of evidence form a developmental scale. The four tiers give educators as well as developers of educational materials and products an easier way to start examining effectiveness without making the commitment to the type of scientifically-based research that NCLB once required.

We think of the four tiers of evidence defined in ESSA as a pyramid as shown in this figure.

ESSA levels of evidence pyramid

  1. RCT. At the apex is Tier 1, defined by ESSA as a randomized control trial (RCT), considered the gold standard in the NCLB era.
  2. Matched Comparison or “quasi-experiments”. With Tier 2 the WWC also allowed for less rigorous experimental research design, such as matched comparisons or quasi-experiments (QE) where schools, teachers, and students (experimental units) independently chose to engage in the program. QEs are permitted but accepted “with reservations” because without random assignment there is the possibility of “selection bias.” For example, teachers who do well at preparing kids for tests might be more likely to participate in a new program than teachers who don’t excel at test preparation. With an RCT we can expect that such positive traits are equally distributed in the experiment between users and non-users.
  3. Correlational. Tier 3 is an important and useful addition to evidence, as a weaker but readily achieved method once the developer has a product running in schools. At that point, they have an opportunity to see if critical elements of the program correlate with outcomes of interest. This provides promising evidence, which is useful for both improving the product and giving the schools some indication that it is helping. This evidence suggests that it might be worthwhile to follow up with a tier 2 study for more definitive results.
  4. Rationale. The base level or Tier 4 is the expectation that any product should have a rationale based on learning science for why it is likely to work. Schools will want this basic rationale for why a program should work before trying it out. Our colleagues at Digital Promise have announced a service in which developers are certified as meeting Tier 4 standards.

Each subsequent tier of evidence (from number 4 to 1) improves what’s considered the “rigor” of the research design. It is important to understand that the hierarchy has nothing to do with whether the results can be generalized from the setting of the study to the district where the decision-maker resides.

While the NCLB-era focus on strong design puts emphasis on the Tier 1 RCT, we see Tiers 2 and 3 as an opportunity for lower cost and faster-turn-around “rapid-cycle evaluations” (RCE.) Tier 1 RCTs have given education research a well-deserved reputation as slow and expensive. It can take one to two years to complete an RCT, with additional time needed for data collection, analysis, and reporting. This extensive work also includes recruiting districts that are willing to participate in the RCT and often puts the cost of the study in the millions of dollars. We have conducted dozens of RCTs following the NCLB-era rules, but advocate less expensive studies in order to get the volume of evidence schools need. In contrast to an RCT, an RCE can use existing data from a school system can be both faster and far less expensive.

There is some controversy about whether schools should use lower-tier evidence, which might be subject to “selection bias.” Randomized control trials are protected from selection bias since users and non-user are assigned randomly, whether they like it or not. It is well known and has been recently pointed out by Robert Slavin that using a matched comparison, a study where teachers chose to participate in the pilot of a product, can result in unmeasured variables, technically “confounders” that affect outcomes. These variables are associated with the qualities that motivate a teacher to pursue pilot studies and their desire to excel in teaching. The comparison group may lack these characteristics that help the self-selected program users succeed. Studies of Tiers 2 and 3 will always have, by definition, unmeasured variables that may act as confounders.

While obviously a concern, there are ways that researchers can statistically control important characteristics associated with selection to use a program. For example, the amount of a teacher’s motivation to use edtech products can be controlled by collecting information from the prior year on the amount of usage by the teacher and students of a full set of products. Past studies looking at the conditions under which there is correspondence between results of RCTs and matched comparison studies that evaluate the impact of a given program have established that it is exactly “focal” variables such as motivation, that are influential confounders. Controlling for a teacher’s demonstrated motivation and students’ incoming achievement may go very far in adjusting away bias. We suggest this in a design memo for a study now being undertaken. This statistical control meets the ESSA requirement for Tiers 2 and 3.

We have a more radical proposal for controlling all kinds of bias that we address in the next posting in this series.

2020-06-22

Ending a Two-Decade-Old Research Legacy

This is the first of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. This post focuses on research methods established in the last 20 years and the reason that much of that research hasn’t been useful to schools. Subsequent posts will present an approach that works for schools.

With the COVID-19 crisis and school closures in full swing, use of edtech is predicted to not just continue, but expand when the classrooms they have replaced come back in use. There is little information available about which edtech products work for whom and under what conditions and the lack of evidence is noted. As a research organization specializing in rigorous program evaluations, Empirical Education, working with Evidentally, Inc., has been developing methods for providing schools with useful and affordable information.

The “modern” era of education research was established with No Child Left Behind, the education law passed in 2002, which established the Institute of Education Sciences (IES). The declaration of “scientifically-based research” in NCLB sparked a major undertaking assigned to the IES. To meet NCLB’s goals for improvement, it was essential to overthrow education research methods that lacked the practical goal of determining whether programs, products, practices, or policies moved the needle for an outcome of interest. These outcomes could be test scores or other measures, such as discipline referrals, that the schools were concerned with.

The kinds of questions that the learning sciences of the time answered were different: for example, how can we compare tests with tasks outside of the test? Or what is the structure of the classroom dialogue? Instead of these qualitative methods, IES looked to the medical field and other areas of policy like workforce training where randomization was used to assign subjects to a treatment group and a control group. Statistical summaries were then used to decide whether the researcher can reasonably conclude that a difference of the magnitude observed is big enough that it is unlikely to happen without a real effect.

In an attempt to mimic the medical field, IES set up the What Works Clearinghouse (WWC), which became the arbiter of acceptable research. WWC focused on getting the research design right. Many studies would be disqualified, with very few, and sometimes just one, meeting design requirements considered acceptable to provide valid causal evidence. This led to the idea that all programs, products, practices, or policies needed at least one good study that would prove its efficacy. The focus on design (or “internal validity”) was to the exclusion of generalizability (or “external validity”). Our team has conducted dozens of RCTs and appreciates the need for careful design. But we try to keep in mind the information that schools need.

Schools need to know whether the current version of the product will work, when implemented using district resources and with its specific student and teacher populations. A single acceptable study may have been conducted a decade ago with ethnicities different from the district that is looking for evidence of what will work for them. The WWC has worked hard to be fair to each intervention, especially where there are multiple studies that meet their standards. But, the key is that each separate study comes with an average conclusion. While the WWC notes the demographic composition of the subjects in the study, difference results for subgroups when tested by the study are considered secondary.

The Every Student Succeeds Act (ESSA) was passed with bi-partisan support in late 2015 and replaced NCLB. ESSA replaced scientifically-based research required by NCLB with four evidence tiers. While this was an important advance, ESSA retained the WWC as providing the definition of the top two tiers. The WWC, which remains the arbiter of evidence validity, gives the following explanation of the purpose of the evidence tiers.

“Evidence requirements under the Every Student Succeeds Act (ESSA) are designed to ensure that states, districts, and schools can identify programs, practices, products, and policies that work across various populations.”

The key to ESSA’s failings is in the final clause: “that work across various populations.” As a legacy of the NCLB-era, the WWC is only interested in the average impact across the various populations in the study. The problem is that district or school decision-makers need to know if the program will work in their schools given their specific population and resources.

The good news is that the education research community is recognizing that ignoring the population, region, product, and implementation differences is no longer necessary. Strict rules were needed to make the paradigm shift to the RCT-oriented approach. But now that NCLB’s paradigm has been in place for almost two decades, and generations of educational researchers have been trained in it, we can broaden the outlook. Mark Schneider, IES’s current director, defines the IES mission as being “in the business of identifying what works for whom under what conditions.” This framing is a move toward a broader focus with more relevant results. Researchers are looking at how to generalize results; for example, the Generalizer tool developed by Tipton and colleagues uses demographics of a target district to generate an applicability metric. The Jefferson Education Exchange’s EdTech Genome Project has focused on implementation models as an important factor in efficacy.

Our methods move away from the legacy of the last 20 years to lower the cost of program evaluations, while retaining the scientific rigor and avoiding biases that give schools misleading information. Lowering cost makes it feasible for thousands of small local studies to be conducted on the multitude of school products. Instead of one or a small handful of studies for each product, we can use a dozen small studies encompassing the variety of contexts so that we can determine for whom and under what conditions the product works.

Read the second part of this series next.

2020-06-02

Feds Moving Toward a More Rational and Flexible Approach to Teacher Support and Evaluation

Congress is finally making progress on a bill to replace NCLB. Here’s an excerpt from a summary of the draft law.

TITLE II–
Helps states support teachers– The bill provides resources to states and school districts to implement various activities to support teachers, principals, and other educators, including allowable uses of funds for high quality induction programs for new teachers, ongoing professional development opportunities for teachers, and programs to recruit new educators to the profession. Ends federal mandates on evaluations, allows states to innovate- The bill allows, but does not require, states to develop and implement teacher evaluation systems. This bill eliminates the definition of a highly qualified teacher—which has proven onerous to states and school districts—and provides states with the opportunity to define this term.

This is very positive. It makes teacher evaluation no longer an Obama-imposed requirement but allows states, that want to do it (and there are quite a few of those), to use federal funds to support it. It removes the irrational requirement that “student growth” be a major component of these systems. This will lower the reflexive resistance from unions because the purpose of evaluation can be more clearly associated with teacher support (for more on that argument, see the Real Clear Education piece). It will also encourage the use of observation and feedback from administrators and mentors. Removing the outmoded definition of “highly qualified teacher” opens up the possibility of wider use of research-based analyses of what is important to measure in effective teaching.

A summary is also provided by EdWeek. On a separate note, it says: “That new research and innovation program that some folks were describing as sort of a next generation ‘Investing in Innovation’ program made it into the bill. (Sens. Orrin Hatch, R-Utah, and Michael Bennet, D-Colo., are big fans, as is the administration.)”

2015-11-24

Unintended Consequences of Using Student Test Scores to Evaluate Teachers

There has been a powerful misconception driving policy in education. It’s a case where theory was inappropriately applied to practice. The misconception has had unintended consequences. It is helping to lead large numbers of parents to opt out of testing and could very well weaken the case in Congress for accountability as ESEA is reauthorized.

The idea that we can use student test scores as one of the measures in evaluating teachers came into vogue with Race to the Top. As a result of that and related federal policies, 38 states now include measures of student growth in teacher evaluations.

This was a conceptual advance over the NCLB definition of teacher quality in terms of preparation and experience. The focus on test scores was also a brilliant political move. The simple qualification for funding from Race to the Top—a linkage between teacher and student data—moved state legislatures to adopt policies calling for more rigorous teacher evaluations even without funding states to implement the policies. The simplicity of pointing to student achievement as the benchmark for evaluating teachers seemed incontrovertible.

It also had a scientific pedigree. Solid work had been accomplished by economists developing value-added modeling (VAM) to estimate a teacher’s contribution to student achievement. Hanushek et al.’s analysis is often cited as the basis for the now widely accepted view that teachers make the single largest contribution to student growth. The Bill and Melinda Gates Foundation invested heavily in its Measures of Effective Teaching (MET) project, which put the econometric calculation of teachers’ contribution to student achievement at the center of multiple measures.

The academic debates around VAM remain intense concerning the most productive statistical specification and evidence for causal inferences. Perhaps the most exciting area of research is in analyses of longitudinal datasets showing that students who have teachers with high VAM scores continue to benefit even into adulthood and career—not so much in their test scores as in their higher earnings, lower likelihood of having children as teenagers, and other results. With so much solid scientific work going on, what is the problem with applying theory to practice? While work on VAMs has provided important findings and productive research techniques, there are four important problems in applying these scientifically-based techniques to teacher evaluation.

First, and this is the thing that should have been obvious from the start, most teachers teach in grades or subjects where no standardized tests are given. If you’re conducting research, there is a wealth of data for math and reading in grades three through eight. However, if you’re a middle-school principal and there are standardized tests for only 20% of your teachers, you will have a problem using test scores for evaluation.

Nevertheless, federal policy required states—in order to receive a waiver from some of the requirements of NCLB—to institute teacher evaluation systems that use student growth as a major factor. To fill the gap in test scores, a few districts purchased or developed tests for every subject taught. A more wide-spread practice is the use of Student Learning Objectives (SLOs). Unfortunately, while they may provide an excellent process for reflection and goal setting between the principal and teacher, they lack the psychometric properties of VAMs, which allow administrators to objectively rank a teacher in relation to other teachers in the district. As the Mathematica team observed, “SLOs are designed to vary not only by grade and subject but also across teachers within a grade and subject.” By contrast, academic research on VAM gave educators and policy makers the impression that a single measure of student growth could be used for teacher evaluation across grades and subjects. It was a misconception unfortunately promoted by many VAM researchers who may have been unaware that the technique could only be applied to a small portion of teachers.

There are several additional reasons that test scores are not useful for teacher evaluation.

The second reason is that VAMs or other measures of student growth don’t provide any indication as to how a teacher can improve. If the purpose of teacher evaluation is to inform personnel decisions such as terminations, salary increases, or bonuses, then, at least for reading and math teachers, VAM scores would be useful. But we are seeing a widespread orientation toward using evaluations to inform professional development. Other kinds of measures, most obviously classroom observations conducted by a mentor or administrator—combined with feedback and guidance—provide a more direct mapping to where the teacher needs to improve. The observer-teacher interactions within an established framework also provide an appropriate managerial discretion in translating the evaluation into personnel decisions. Observation frameworks not only break the observation into specific aspects of practice but provide a rubric for scoring in four or five defined levels. A teacher can view the training materials used to calibrate evaluators to see what the next level looks like. VAM scores are opaque in contrast.

Third, test scores are associated with a narrow range of classroom practice. My colleague, Val Lazarev, and I found an interesting result from a factor analysis of the data collected in the MET project. MET collected classroom videos from thousands of teachers, which were then coded using a number of frameworks. The students were tested in reading and/or math using an assessment that was more focused on problem-solving and constructive items than is found in the usual state test. Our analysis showed that a teacher’s VAM score is more closely associated with the framework elements related to classroom and behavior management (i.e., keeping order in the classroom) than the more refined aspects of dialog with students. Keeping the classroom under control is a fundamental ability associated with good teaching but does not completely encompass what evaluators are looking for. Test scores, as the benchmark measure for effective teaching, may not be capturing many important elements.

Fourth, achievement test scores (and associated VAMs) are calculated based on what teachers can accomplish with respect to improving test scores from the time students appear in their classes in the fall to when they take the standardized test in the spring. If you ask people about their most influential teacher, they talk about being inspired to take up a particular career or about keeping them in school. These are results that are revealed in following years or even decades. A teacher who gets a student to start seeing math in a new way may not get immediate results on the spring test but may get the student to enroll in a more challenging course the next year. A teacher who makes a student feel at home in class may be an important part of the student not dropping out two years later. Whether or not teachers can cause these results is speculative. But the characteristics of warm, engaging, and inspiring teaching can be observed. We now have analytic tools and longitudinal datasets that can begin to reveal the association between being in a teacher’s class and the probability of a student graduating, getting into college, and pursuing a productive career. With records of systematic classroom observations, we may be able, in the future, to associate teaching practices with benchmarks that are more meaningful than the spring test score.

The policy-makers’ dream of an algorithm for translating test scores into teacher salary levels is a fallacy. Even the weaker provisions such as the vague requirement that student growth must be an important element among multiple measures in teacher evaluations has led to a profusion of methods of questionable utility for setting individual goals for teachers. But the insistence on using annual student achievement as the benchmark has led to more serious, perhaps unintended, consequences.

Teacher unions have had good reason to object to using test scores for evaluations. Teacher opposition to this misuse of test scores has reinforced a negative perception of tests as something that teachers oppose in general. The introduction of the new Common Core tests might have been welcomed by the teaching profession as a stronger alignment of the test with the widely shared belief about what is important for students to learn. But the change was opposed by the profession largely because it would be unfair to evaluate teachers on the basis of a test they had no experience preparing students for. Reducing the teaching profession’s opposition to testing may help reduce the clamor of the opt-out movement and keep the schools on the path of continuous improvement of student assessment.

We can return to recognizing that testing has value for teachers as formative assessment. And for the larger community it has value as assurance that schools and districts are maintaining standards, and most importantly, in considering the reauthorization of NCLB, not failing to educate subgroups of students who have the most need.

A final note. For purposes of program and policy evaluation, for understanding the elements of effective teaching, and for longitudinal tracking of the effect on students of school experiences, standardized testing is essential. Research on value-added modeling must continue and expand beyond tests to measure the effect of teachers on preparing students for “college and career”. Removing individual teacher evaluation from the equation will be a positive step toward having the data needed for evidence-based decisions.

An abbreviated version of this blog post can be found on Real Clear Education.

2015-09-10

Need for Product Evaluations Continues to Grow

There is a growing need for evidence of the effectiveness of products and services being sold to schools. A new release of SIIA’s product evaluation guidelines is now available at the Selling to Schools website (with continued free access to SIIA members), to help guide publishers in measuring the effectiveness of the tools they are selling to schools.

It’s been almost a decade since NCLB made its call for “scientifically-based research,” but the calls for research haven’t faded away. This is because resources available to schools have diminished over that time, heightening the importance of cost benefit trade-offs in spending.

NCLB has focused attention on test score achievement, and this metric is becoming more pervasive; e.g., through a tie to teacher evaluation and through linkages to dropout risk. While NCLB fostered a compliance mentality—product specs had to have a check mark next to SBR—the need to assure that funds are not wasted is now leading to a greater interest in research results. Decision-makers are now very interested in whether specific products will be effective, or how well they have been working, in their districts.

Fortunately, the data available for evaluations of all kinds is getting better and easier to access. The U.S. Department of Education has poured hundreds of millions of dollars into state data systems. These investments make data available to states and drive the cleaning and standardizing of data from districts. At the same time, districts continue to invest in data systems and warehouses. While still not a trivial task, the ability of school district researchers to get the data needed to determine if an investment paid off—in terms of increased student achievement or attendance—has become much easier over the last decade.

The reauthorization of ESEA (i.e., NCLB) is maintaining the pressure to evaluate education products. We are still a long way from the draft reauthorization introduced in Congress becoming a law, but the initial indications are quite favorable to the continued production of product effectiveness evidence. The language has changed somewhat. Look for the phrase “evidence based”. Along with the term “scientifically-valid”, this new language is actually more sophisticated and potentially more effective than the old SBR neologism. Bob Slavin, one of the reviewers of the SIIA guidelines, says in his Ed Week blog that “This is not the squishy ‘based on scientifically-based evidence’ of NCLB. This is the real McCoy.” It is notable that the definition of “evidence-based” goes beyond just setting rules for the design of research, such as the SBR focus on the single dimension of “internal validity” for which randomization gets the top rating. It now asks how generalizable the research is or its “external validity”; i.e., does it have any relevance for decision-makers?

One of the important goals of the SIIA guidelines for product effectiveness research is to improve the credibility of publisher-sponsored research. It is important that educators see it as more than just “market research” producing biased results. In this era of reduced budgets, schools need to have tangible evidence of the value of products they buy. By following the SIIA’s guidelines, publishers will find it easier to achieve that credibility.

2011-11-12

Research: From NCLB to Obama’s Blueprint for ESEA

We can finally put “Scientifically Based Research” to rest. The term that appeared more than 100 times in NCLB appears zero times in the Obama administration’s Blueprint for Reform, which is the document outlining its approach to the reauthorization of ESEA. The term was always an awkward neologism, coined presumably to avoid simply saying “scientific research.” It also allowed NCLB to contain an explicit definition to be enforced—a definition stipulating not just any scientific activities, but research aimed at coming to causal conclusions about the effectiveness of some product, policy, or laboratory procedure.

A side effect of the SBR focus has been the growth of a compliance mentality among both school systems and publishers. Schools needed some assurance that a product was backed by SBR before they would spend money, while textbooks were ranked in terms of the number of SBR-proven elements they contained.

Some have wondered if the scarcity of the word “research” in the new Blueprint might signal a retreat from scientific rigor and the use of research in educational decisions (see, for example, Debra Viadero’s blog). Although the approach is indeed different, the new focus makes a stronger case for research and extends its scope into decisions at all levels.

The Blueprint shifts the focus to effectiveness. The terms “effective” or “effectiveness” appear about 95 times in the document. “Evidence” appears 18 times. And the compliance mentality is specifically called out as something to eliminate.

“We will ask policymakers and educators at all levels to carefully analyze the impact of their policies, practices, and systems on student outcomes. … And across programs, we will focus less on compliance and more on enabling effective local strategies to flourish.” (p. 35)

Instead of the stiff definition of SBR, we now have a call to “policymakers and educators at all levels to carefully analyze the impact of their policies, practices, and systems on student outcomes.” Thus we have a new definition for what’s expected: carefully analyzing impact. The call does not go out to researchers per se, but to policymakers and educators at all levels. This is not a directive from the federal government to comply with the conclusions of scientists funded to conduct SBR. Instead, scientific research is everybody’s business now.

Carefully analyzing the impact of practices on student outcomes is scientific research. For example, conducting research carefully requires making sure the right comparisons are made. A study that is biased by comparing two groups with very different motivations or resources is not a careful analysis of impact. A study that simply compares the averages of two groups without any statistical calculations can mistakenly identify a difference when there is none, or vice versa. A study that takes no measure of how schools or teachers used a new practice—or that uses tests of student outcomes that don’t measure what is important—can’t be considered a careful analysis of impact. Building the capacity to use adequate study design and statistical analysis will have to be on the agenda of the ESEA if the Blueprint is followed.

Far from reducing the role of research in the U.S. education system, the Blueprint for ESEA actually advocates a radical expansion. The word “research” is used only a few times, and “science” is used only in the context of STEM education. Nonetheless, the call for widespread careful analysis of the evidence of effective practices that impact student achievement broadens the scope of research, turning all policymakers and educators into practitioners of science.

2010-03-17

Development Grant Awarded to Empirical Education

The U.S. Department of Education awarded Empirical Education a research grant to develop web-based software tools to support school administrators in conducting their own program evaluations. The two-and-a-half year project was awarded through the Small Business Innovative Research program administered and funded by the Institute of Education Sciences in the U.S. Department of Education. The proposal received excellent reviews in this competitive program. One reviewer remarked: “This software system is in the spirit of NCLB and IES to make curriculum, professional development, and other policy decisions based on rigorous research. This would be an improvement over other systems that districts and schools use that mostly generate tables.” While current data-driven decision making systems provide tabular information or comparisons in terms of bar graphs, the software to be developed—an enhancement of our current MeasureResults™ program—helps school personnel create appropriate research designs following a decision process. It then provides access to a web-based service that uses sophisticated statistical software to test whether there is a difference in the results for a new program compared to the school‘s existing programs. The reviewer added that the web-based system instantiates a “very good idea to provide [a] user-friendly and cost-effective software system to districts and schools to insert data for evaluating their own programs.” Another reviewer agreed, noting that: “The theory behind the tool is sound and would provide analyses appropriate to the questions being asked.” The reviewer also remarked that “…this would be a highly valuable tool. It is likely that the tool would be widely disseminated and utilized.” The company will begin deploying early versions of the software in school systems this coming fall.

2008-05-22
Archive