blog posts and news stories

Going Beyond the NCLB-Era to Reduce Achievement Gaps

We just published on Medium an important article that traces the recent history of education research to show how an unfortunate legacy of NCLB has weakened research methods, as applied to school use of edtech, and made invisible resulting achievement gaps. This article was originally a set of four blog posts by CEO Denis Newman and Chief Scientist Andrew Jaciw. The article shows how the legacy belief that differential subgroup effects (e.g., based on poverty, prior achievement, minority status, English proficiency) found in experiments are, at best, a secondary exploration that has left serious achievement gaps unexamined. And the false belief that only studies based on data collected before program implementation are free of misleading biases has given research the warranted reputation as very slow and costly. Instead, we present a rationale for low-cost and fast-turnaround studies using cloud-based edtech usage data combined with already collected school district administrative data. Working in districts that have already implemented the program lowers the cost to the point that a dozen small studies each examining subgroup effects, which Jaciw has shown to be relatively unbiased, can be combined to produce generalizable results. These results are what school decision-makers need in order to purchase edtech that works for all their students.

Read the article on medium here.

Or read the 4-part blog series we posted this past summer.

1. Ending a Two-Decade Research Legacy

2. ESSA Evidence Tiers and Potential for Bias

3. Validating Research that Helps Reduce Achievement Gaps

4. Putting Many Small Studies Together

2020-09-16

Putting Many Small Studies Together

This is the last of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision makers. Here we show how lots of small studies can give better evidence to resolve achievement gaps. To read the the first 3 parts, use these links.

1. Ending a Two-Decade Research Legacy

2. ESSA Evidence Tiers and Potential for Bias

3. Validating Research that Helps Reduce Achievement Gaps

The NCLB-era of the single big study should be giving way to the analysis of the differential impacts for subgroups from multiple studies. This is the information that schools need in order to reduce achievement gaps. Today’s technology landscape is ready for this major shift in the research paradigm. The school shutdowns resulting from the COVID-19 pandemic have demonstrated that the value of edtech products goes beyond just the cost reduction of eliminating expensive print materials. Over the last decade digital learning products have collected usage data which provides rich and systematic evidence of how products are being used and by whom. At the same time, schools have accumulated huge databases of digital records on demographics and achievement history, with public data at a granularity down to the grade-level. Using today’s “big data” analytics, this wealth of information can be put to work for a radical reduction in the cost of showing efficacy.

Fast turnaround, low cost research studies will enable hundreds of studies to be conducted providing information to school decision-makers that answer their questions. Their questions are not just “which program, on average, produces the largest effect?” Their questions are “which program is most likely to work in my district, with my kids and teachers, and with my available resources, and which are most likely to reduce gaps of greatest concern?”

Meta-analysis is a method for combining multiple studies to increase generalizability (Shadish, Cook, & Campbell, 2002). With meta-analysis, we can test for stability of effects across sites and synthesize those results, where warranted, based on specific statistical criteria. While moderator analysis is considered merely exploratory in the NCLB-era, using meta-analysis, moderator results from multiple small studies, can in combination provide confirmation of a differential impact. Meta-analysis, or other approaches to research synthesis, combined with big data present new opportunities to move beyond the NCLB-era philosophy that prizes the single big study to prove the efficacy of a program.

While addressing WWC and ESSA standards, we caution, that a single study in one school district, or even several studies in several school districts, may not provide enough useful information to generalize to other school districts. For research to be the most effective, we need studies in enough districts to represent the full diversity of relevant populations. Studies need to systematically include moderator analysis for an effective way to generalize impact for subgroups.

The definitions provided in ESSA do not address how much information is needed to generalize from a particular study for implementation in other school districts. While we accept that well-designed Tier 2 or 3 studies are necessary to establish an appropriate level of rigor, we do not believe a single study is sufficient to declare a program will be effective across varied populations. We note that the Standards for Excellence in Education Research (SEER) recently adopted by the IES, call for facilitating generalizability.

After almost two decades of exclusive focus on the design of the single study we need to more effectively address achievement gaps with the specifics that school decision-makers need. Lowering the cost and turn-around time for research studies that break out subgroup results is entirely feasible. With enough studies qualified for meta-analysis, a new wealth of information will be available to educators who want to select the products that will best serve their students. This new order will democratize learning across the country, reducing inequities and raising student achievement in K-12 schools.

2020-07-07

Validating Research that Helps Reduce Achievement Gaps

This is the third of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we show how issues of bias affecting the NCLB-era average impact estimates are not necessarily inherited by the differential subgroup estimates. (Read the first and the second parts of the story here.)

There’s a joke among researchers: “When a study is conducted in Brooklyn and in San Diego the average results should apply to Wichita.”

Following the NCLB-era rules, researchers usually put all their resources for a study into the primary result, providing a report on the average effectiveness of a product or tool across all populations. This leads to disregarding subgroup differences and misses the opportunity to discover that the program studied may work better or worse for certain populations. A particularly strong example is from our own work where this philosophy led to a misleading conclusion. While the program we studied was celebrated as working on average, it turned out not to help the Black kids. It widened an existing achievement gap. Our point is that differential impacts are not just extras, they should be the essential results of school research.

In many cases we find that a program works well for some students and not so much for others. In all cases the question is whether the program increases or decreases an existing gap. Researchers call this differential impact an interaction between the characteristics of the people in the study and the program, or they call it a moderated impact as in: the program effect is moderated by the characteristic. If the goal is to narrow an achievement gap, the difference between subgroups (for example: English language learners, kids in free lunch programs, or girls versus boys) in the impact provides the most useful information.

Examining differential impacts across subgroups also turns out to be less subject to the kinds of bias that have concerned NCLB-era researchers. In a recent paper, Andrew Jaciw showed that with matched comparison designs, estimation of the differential effect of a program on contrasting subgroups of individuals can be less susceptible to bias than the researcher’s estimate of the average effect. Moderator effects are less prone to certain forms of selection bias. In this work, he develops the idea that when evaluating differential effects using matched comparison studies that involve cross-site comparisons, standard selection bias is “differenced away” or negated. While a different form of bias may be introduced, he shows empirically that it is considerably smaller. This is a compelling and surprising result and speaks to the importance of making moderator effects a much greater part of impact evaluations. Jaciw finds that the differential effects for subgroups do not necessarily inherit the biases that are found in average effects, which were the core focus of the NCLB era.

Therefore, matched comparison studies, may be less biased than one might think for certain important quantities. On the other hand, RCTs, which are often believed to be without bias (thus, “gold standard”) may be biased in ways that are often overlooked. For instance, results from RCTs may be limited by selection based on who chooses to participate. The teachers and schools who agree to be part of the RCT might bias results in favor of those more willing to take risks and try new things. In that case, the results wouldn’t generalize to less adventurous teachers and schools.

A general advantage of RCEs (in the form of matched comparison experiments) have over RCTs is they can be conducted under more true-to-life circumstances. If using existing data, outcomes reflect results from field implementations as they happened in real life. Such RCEs can be performed more quickly and at a lower cost than RCTs. These can be used by school districts, which have paid for a pilot implementation of a product and want to know in June whether the program should be expanded in September. The key to this kind of quick turn-around, rapid-cycle evaluation is to use data from the just completed school year rather than following the NCLB-era habit of identifying schools that have never implemented the program and assigning teachers as users and non-users before implementation begins. Tools, such as the RCE Coach (now the Evidence to Insights Coach) and Evidentally’s Evidence Suite, are being developed to support district-driven as well as developer-driven matched comparisons.

Commercial publishers have also come under criticism for potential bias. The Hechinger Report recently published an article entitled: Ed tech companies promise results, but their claims are often based on shoddy research. Also from the Hechinger Report, Jill Barshay offered a critique entitled The dark side of education research: widespread bias. She cites a working paper that was recently published in a well-regarded journal by a Johns Hopkins team led by Betsy Wolf, who worked with reports of studies within the WWC database (all using either RCTs or matched comparisons). Wolf compared differences in results where some studies were paid for by the program developer and others paid for through independent sources (such as IES grants to research organizations). Wolf’s study found the size of the effect based on source of funding substantially favored developer sponsored studies. The most likely explanation was that developers are more prone to avoid publishing unfavorable results. This is called a “developer effect.” While we don’t doubt that Wolf found a real difference, the interpretation and importance of the bias can be questioned.

First while more selective reporting by developers may bias their reporting upward, other biases may lead to smaller effects for independently-funded research. Following NCLB-era rules, independently-funded researchers must convince school districts to use the new materials or programs to be studied. But many developer-driven studies are conducted where the materials or program being studied is already in use (or being considered for use and established as a good fit for adoption). The bias to the average overall effect size from a lack of inherent interest might result in a lower effect estimate.

When a developer is budgeting for an evaluation, working in a district that has already invested in the program and succeeded in its implementation is often the best way to provide the information that other districts need since it shows not just the outcomes but a case study of an implementation. Results from a district that has chosen a program for a pilot and succeeded in its implementation may not be an unfair bias. While the developer selected districts with successful pilots may score higher than districts recruited by an independently funded researcher, they are also more likely to have commonalities with districts interested in adopting the program. Recruiting schools with no experience with the program may bias the results to be lower than they should be.

Second, the fact that bias was found in the standard NCLB-era average between the user and non-user groups provides another reason to drop the primacy of the overall average and put our focus on the subgroup moderator analysis where there may be less bias. Average outcomes across all populations has little information value for school district decision-makers. Moderator effects are what they need if their goal is to reduce rather than enlarge an achievement gap.

We have no reason to assume that the information school decision-makers need has inherited the same biases that have been demonstrated in the use of developer-driven Tier 2 and 3 studies in evaluation of programs. We do see that the NCLB-era habit of ignoring subgroup differences reinforces the status quo and hides achievement gaps in K-12 schools.

Next, we advocate replacing the NCLB-era single study with meta-analyses of many studies where the focus is on the moderators.

2020-06-26

ESSA’s Evidence Tiers and Potential for Bias

This is the second of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. Here we explain how ESSA introduced flexibility and how NCLB-era habits have raised issues about bias. (Read the first one here.)

The tiers of evidence defined in the Every Student Succeeds Act (ESSA) give schools and researchers greater flexibility, but are not without controversy. Flexibility creates the opportunity for biased results. The ESSA law, for example, states that studies must statistically control for “selection bias”, recognizing that teachers who “select” to use a program may have other characteristics that give them an advantage and the results for those teachers could be biased upward. As we trace the problem of bias it is useful to go back to the interpretation of ESSA that originated with the NCLB-era approach to research.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based educational products that automatically report usage data, it is important to clarify both ESSA’s useful advances and how the four tiers fail to address a critical scientific concept needed for schools to make use of research.

We’ve written elsewhere how the ESSA tiers of evidence form a developmental scale. The four tiers give educators as well as developers of educational materials and products an easier way to start examining effectiveness without making the commitment to the type of scientifically-based research that NCLB once required.

We think of the four tiers of evidence defined in ESSA as a pyramid as shown in this figure.

ESSA levels of evidence pyramid

  1. RCT. At the apex is Tier 1, defined by ESSA as a randomized control trial (RCT), considered the gold standard in the NCLB era.
  2. Matched Comparison or “quasi-experiments”. With Tier 2 the WWC also allowed for less rigorous experimental research design, such as matched comparisons or quasi-experiments (QE) where schools, teachers, and students (experimental units) independently chose to engage in the program. QEs are permitted but accepted “with reservations” because without random assignment there is the possibility of “selection bias.” For example, teachers who do well at preparing kids for tests might be more likely to participate in a new program than teachers who don’t excel at test preparation. With an RCT we can expect that such positive traits are equally distributed in the experiment between users and non-users.
  3. Correlational. Tier 3 is an important and useful addition to evidence, as a weaker but readily achieved method once the developer has a product running in schools. At that point, they have an opportunity to see if critical elements of the program correlate with outcomes of interest. This provides promising evidence, which is useful for both improving the product and giving the schools some indication that it is helping. This evidence suggests that it might be worthwhile to follow up with a tier 2 study for more definitive results.
  4. Rationale. The base level or Tier 4 is the expectation that any product should have a rationale based on learning science for why it is likely to work. Schools will want this basic rationale for why a program should work before trying it out. Our colleagues at Digital Promise have announced a service in which developers are certified as meeting Tier 4 standards.

Each subsequent tier of evidence (from number 4 to 1) improves what’s considered the “rigor” of the research design. It is important to understand that the hierarchy has nothing to do with whether the results can be generalized from the setting of the study to the district where the decision-maker resides.

While the NCLB-era focus on strong design puts emphasis on the Tier 1 RCT, we see Tiers 2 and 3 as an opportunity for lower cost and faster-turn-around “rapid-cycle evaluations” (RCE.) Tier 1 RCTs have given education research a well-deserved reputation as slow and expensive. It can take one to two years to complete an RCT, with additional time needed for data collection, analysis, and reporting. This extensive work also includes recruiting districts that are willing to participate in the RCT and often puts the cost of the study in the millions of dollars. We have conducted dozens of RCTs following the NCLB-era rules, but advocate less expensive studies in order to get the volume of evidence schools need. In contrast to an RCT, an RCE can use existing data from a school system can be both faster and far less expensive.

There is some controversy about whether schools should use lower-tier evidence, which might be subject to “selection bias.” Randomized control trials are protected from selection bias since users and non-user are assigned randomly, whether they like it or not. It is well known and has been recently pointed out by Robert Slavin that using a matched comparison, a study where teachers chose to participate in the pilot of a product, can result in unmeasured variables, technically “confounders” that affect outcomes. These variables are associated with the qualities that motivate a teacher to pursue pilot studies and their desire to excel in teaching. The comparison group may lack these characteristics that help the self-selected program users succeed. Studies of Tiers 2 and 3 will always have, by definition, unmeasured variables that may act as confounders.

While obviously a concern, there are ways that researchers can statistically control important characteristics associated with selection to use a program. For example, the amount of a teacher’s motivation to use edtech products can be controlled by collecting information from the prior year on the amount of usage by the teacher and students of a full set of products. Past studies looking at the conditions under which there is correspondence between results of RCTs and matched comparison studies that evaluate the impact of a given program have established that it is exactly “focal” variables such as motivation, that are influential confounders. Controlling for a teacher’s demonstrated motivation and students’ incoming achievement may go very far in adjusting away bias. We suggest this in a design memo for a study now being undertaken. This statistical control meets the ESSA requirement for Tiers 2 and 3.

We have a more radical proposal for controlling all kinds of bias that we address in the next posting in this series.

2020-06-22

Agile Assessment and the Impact of Formative Testing on Student Achievement in Algebra

Empirical Education contracted with Jefferson Education Accelerator (JEA) to conduct a study on the effectiveness of formative testing for improving student achievement in Algebra. We partnered with a large urban school district in the northeast U.S. to evaluate their use of Agile Assessment. Developed by experts at the Charles A Dana Center at the University of Texas and education company Agile Mind, Agile Assessment is a flexible system for developing, administering, and analyzing student assessments that are aligned by standard, reading level, and level of difficulty. The district used benchmark Agile Assessments in the fall, winter, and spring to assess student performance in Algebra along with a curriculum it had chosen independent of assessments.

We conducted a quasi-experimental comparison group study using data from the 2016-17 school year and examined the impact of Agile Assessment usage on student achievement for roughly 1,000 students using the state standardized assessment in Algebra.

There were three main findings from the study:

  1. Algebra scores for students who used Agile Assessment were better than scores of comparison students. The result had an effect size of .30 (p = .01), which corresponds to a 12-percentile point gain, adjusting for differences in student demographics and pretest between treatment and comparison students.
  2. The positive impact of Agile Assessment generalized across many student subgroups, including Hispanic students, economically disadvantaged students and special education students.
  3. Outcomes on the state Algebra assessment were positively associated with the average score on the Agile Assessment benchmark tests. That said, adding the average score on Agile Assessment benchmark tests to the linear model increased its predictive power by a small amount.

These findings provide valuable evidence in favor of formative testing for the district and other stakeholders. Given disruptions in the current public school paradigm, increased frequency of formative assessment could provide visibility towards greater personalized instruction and ultimately increase student outcomes. You can read the full research report here.

2020-06-17

Ending a Two-Decade-Old Research Legacy

This is the first of a four-part blog posting about changes needed to the legacy of NCLB to make research more useful to school decision-makers. This post focuses on research methods established in the last 20 years and the reason that much of that research hasn’t been useful to schools. Subsequent posts will present an approach that works for schools.

With the COVID-19 crisis and school closures in full swing, use of edtech is predicted to not just continue, but expand when the classrooms they have replaced come back in use. There is little information available about which edtech products work for whom and under what conditions and the lack of evidence is noted. As a research organization specializing in rigorous program evaluations, Empirical Education, working with Evidentally, Inc., has been developing methods for providing schools with useful and affordable information.

The “modern” era of education research was established with No Child Left Behind, the education law passed in 2002, which established the Institute of Education Sciences (IES). The declaration of “scientifically-based research” in NCLB sparked a major undertaking assigned to the IES. To meet NCLB’s goals for improvement, it was essential to overthrow education research methods that lacked the practical goal of determining whether programs, products, practices, or policies moved the needle for an outcome of interest. These outcomes could be test scores or other measures, such as discipline referrals, that the schools were concerned with.

The kinds of questions that the learning sciences of the time answered were different: for example, how can we compare tests with tasks outside of the test? Or what is the structure of the classroom dialogue? Instead of these qualitative methods, IES looked to the medical field and other areas of policy like workforce training where randomization was used to assign subjects to a treatment group and a control group. Statistical summaries were then used to decide whether the researcher can reasonably conclude that a difference of the magnitude observed is big enough that it is unlikely to happen without a real effect.

In an attempt to mimic the medical field, IES set up the What Works Clearinghouse (WWC), which became the arbiter of acceptable research. WWC focused on getting the research design right. Many studies would be disqualified, with very few, and sometimes just one, meeting design requirements considered acceptable to provide valid causal evidence. This led to the idea that all programs, products, practices, or policies needed at least one good study that would prove its efficacy. The focus on design (or “internal validity”) was to the exclusion of generalizability (or “external validity”). Our team has conducted dozens of RCTs and appreciates the need for careful design. But we try to keep in mind the information that schools need.

Schools need to know whether the current version of the product will work, when implemented using district resources and with its specific student and teacher populations. A single acceptable study may have been conducted a decade ago with ethnicities different from the district that is looking for evidence of what will work for them. The WWC has worked hard to be fair to each intervention, especially where there are multiple studies that meet their standards. But, the key is that each separate study comes with an average conclusion. While the WWC notes the demographic composition of the subjects in the study, difference results for subgroups when tested by the study are considered secondary.

The Every Student Succeeds Act (ESSA) was passed with bi-partisan support in late 2015 and replaced NCLB. ESSA replaced scientifically-based research required by NCLB with four evidence tiers. While this was an important advance, ESSA retained the WWC as providing the definition of the top two tiers. The WWC, which remains the arbiter of evidence validity, gives the following explanation of the purpose of the evidence tiers.

“Evidence requirements under the Every Student Succeeds Act (ESSA) are designed to ensure that states, districts, and schools can identify programs, practices, products, and policies that work across various populations.”

The key to ESSA’s failings is in the final clause: “that work across various populations.” As a legacy of the NCLB-era, the WWC is only interested in the average impact across the various populations in the study. The problem is that district or school decision-makers need to know if the program will work in their schools given their specific population and resources.

The good news is that the education research community is recognizing that ignoring the population, region, product, and implementation differences is no longer necessary. Strict rules were needed to make the paradigm shift to the RCT-oriented approach. But now that NCLB’s paradigm has been in place for almost two decades, and generations of educational researchers have been trained in it, we can broaden the outlook. Mark Schneider, IES’s current director, defines the IES mission as being “in the business of identifying what works for whom under what conditions.” This framing is a move toward a broader focus with more relevant results. Researchers are looking at how to generalize results; for example, the Generalizer tool developed by Tipton and colleagues uses demographics of a target district to generate an applicability metric. The Jefferson Education Exchange’s EdTech Genome Project has focused on implementation models as an important factor in efficacy.

Our methods move away from the legacy of the last 20 years to lower the cost of program evaluations, while retaining the scientific rigor and avoiding biases that give schools misleading information. Lowering cost makes it feasible for thousands of small local studies to be conducted on the multitude of school products. Instead of one or a small handful of studies for each product, we can use a dozen small studies encompassing the variety of contexts so that we can determine for whom and under what conditions the product works.

Read the second part of this series next.

2020-06-02

Updating Evidence According to ESSA

The U.S. Department of Education (ED) sets validity standards for evidence of what works in schools through The Every Student Succeeds Act (ESSA), which provides usefully defined tiers of evidence.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based online tools that automatically report usage data, it is important to review the standards and to clarify both ESSA’s useful advances and how the four tiers fail to address some critical scientific concepts. These concepts are needed for states, districts, and schools to make the best use of research findings.

We will cover the following in subsequent postings on this page.

  1. Evidence According to ESSA: Since the founding of the Institute of Education Sciences and NCLB in 2002, the philosophy of evaluation has been focused on the perfection of one good study. We’ll discuss the cost and technical issues this kind of study raises and how it sometimes reinforces educational inequity.
  2. Who is Served by Measuring Average Impact: The perfect design focused on the average impact of a program across all populations of students, teachers, and schools has value. Yet, school decision makers need to also know about the performance differences between specific groups such as students who are poor or middle class, teachers with one kind of preparation or another, or schools with AP courses vs. those without. Mark Schneider, IES’s director defines the IES mission as “IES is in the business of identifying what works for whom under what conditions.” This framing is a move toward a broader focus with more relevant results.
  3. Differential Impact is Unbiased. According to the ESSA standards, studies must statistically control for selection bias and other sources of bias. But biases that impact the average for the population in the study don’t impact the size of the differential effect between subgroups. The interaction between the program and the population characteristic is unaffected. And that’s what the educators need to know about.
  4. Putting many small studies together. Instead of the One Good Study approach we see the need for multiple studies each collecting data on differential impacts for subgroups. As Andrew Coulson of Mind Research Institute put it, we have to move from the One Good Study approach and on to valuing multiple studies with enough variety to be able to account for commonalities among districts. We add that meta-analysis of interaction effects are entirely feasible.

Our team works closely with the ESSA definitions and has addressed many of these issues. Our design for an RCE for the Dynamic Learning Project shows how Tiers 2 and 3 can be combined to answer questions involving intermediate results (or mediators). If you are interested in more information on the ESSA levels of evidence, the video on this page is a recorded webinar that provides clarification.

2020-05-12

Empirical Describes Innovative Approach to Research Design for Experiment on the Value of Instructional Technology Coaching

Empirical Education (Empirical) is collaborating with Digital Promise to evaluate the impact of the Dynamic Learning Project (DLP) on student achievement. The DLP provides school-based instructional technology coaches to participating districts to increase educational equity and impactful use of technology. Empirical is working with data from prior school years, allowing us to continue this work during this extraordinary time of school closures. We are conducting quasi-experiments in three school districts across the U.S. designed to provide evidence that will be useful to DLP stakeholders, including schools and districts considering using the DLP coaching model. Today, Empirical has released its design memo outlining its innovative approach to combining teacher-level and student-level outcomes through experimental and correlational methods.

Digital Promise— through funding and partnership with Google—launched the DLP in 2017 with more than 1,000 teachers in 50 schools across 18 districts in five states. The DLP expanded in the second year of implementation (2018-2019) with more than 100 schools reached across 23 districts in seven states. Digital Promise’s surveys of participating teachers have documented teachers’ belief in the DLP’s ability to improve instruction and increase impactful technology use (see Digital Promise’s extensive postings on the DLP). Our rapid cycle evaluations will work with data from the same cohorts, while adding district administrative data and data on technology use.

Empirical’s studies will establish valuable links between instructional coaching, technology use, and student achievement, all while helping to improve future iterations of the DLP coaching model. As described in our design memo, the study is guided by Digital Promise’s logic model. In this model, coaching is expected to affect an intermediate outcome, measured in Empirical’s research in terms of patterns of usage of edtech applications, as they implicate instructional practices. These patterns and practices are then expected to impact measurable student outcomes. The Empirical team will evaluate the impact of coaching on both the mediator (patterns and practices) and the student test outcomes. We will examine student-level outcomes by subgroup. The data are currently in the collection process, and we expect report publication by summer 2020.

2020-05-01

Updated Research on the Impact of Alabama’s Math, Science, and Technology Initiative (AMSTI) on Student Achievement

We are excited to release the findings of a new round of work conducted to continue our investigation of AMSTI. Alabama’s specialized training program for math and science teachers began over 20 years ago and now reaches over 900 schools across the state. As the program is constantly evolving to meet the demands of new standards and new assessment systems, the AMSTI team and the Alabama State Department of Education continue to support research to evaluate the program’s impact. Our new report builds on the work undertaken last year to answer three new research questions.

  1. What is the impact of AMSTI on reading achievement? We found a positive impact of AMSTI for students on the ACT Aspire reading assessment equivalent to 2 percentile points. This replicates a finding from our earlier 2012 study. This analysis used students of AMSTI-trained science teachers, as the training purposely integrates reading and writing practices into the science modules.
  2. What is the impact of AMSTI on early-career teachers? We found positive impacts of AMSTI for partially-trained math teachers and fully-trained science teachers. The sample of teachers for this analysis was those in their first three years of teaching, with varying levels of AMSTI training.
  3. How can AMSTI continue program development to better serve ELL students? Our earlier work found a negative impact of AMSTI training for ELL students in science. Building upon these results, we were able to identify a small subset of “model ELL AMSTI schools” where there was both a positive impact of AMSTI on ELL students, and where that impact was larger than any school-level effect on ELL students versus the entire sample. By looking at the site-specific best practices of these schools for supporting ELL students in science and across the board, the AMSTI team can start to incorporate these strategies into the program at large.

All research Empirical Education has conducted on AMSTI can be found on our AMSTI webpage.

2020-04-06

Report Released on the Effectiveness of SRI/CAST's Enhanced Units

Empirical Education has released the results of a semester-long randomized experiment on the effectiveness of SRI/CAST’s Enhanced Units (EU). This study was conducted in cooperation with one district in California, and with two districts in Virginia, and was funded through a competitive Investing in Innovation (i3) grant from the U.S. Department of Education. EU combines research-based content enhancement routines, collaboration strategies and technology components for secondary history and biology classes. The goal of the grant is to improve student content learning and higher order reasoning, especially for students with disabilities. EU was developed during a two-year design-based implementation process with teachers and administrators co-designing the units with developers.

The evaluation employed a group randomized control trial in which classes were randomly assigned within teachers to receive the EU curriculum, or continue with business-as-usual. All teachers were trained in Enhanced Units. Overall, the study involved three districts, five schools, 13 teachers, 14 randomized blocks, and 30 classes (15 in each condition, with 18 in biology and 12 in U.S. History). This was an intent-to-treat design, with impact estimates generated by comparing average student outcomes for classes randomly assigned to the EU group with average student outcomes for classes assigned to control group status, regardless of the level of participation in or teacher implementation of EU instructional approaches after random assignment.

Overall, we found a positive impact of EU on student learning in history, but not on biology or across the two domains combined. Within biology, we found that students experienced greater impact on the Evolution unit than the Ecology unit. These findings supports a theory developed by the program developers that EU works especially well with content that progresses in a sequential and linear way. We also found a positive differential effect favoring students with disabilities, which is an encouraging result given the goal of the grant.

The full report for this study can be downloaded using the link below.

Enhanced Units final report

2019-12-26
Archive