blog posts and news stories

Updating Evidence According to ESSA

The U.S. Department of Education (ED) sets validity standards for evidence of what works in schools through The Every Student Succeeds Act (ESSA), which provides usefully defined tiers of evidence.

When we helped develop the research guidelines for the Software & Information Industry Association, we took a close look at ESSA and how it is often interpreted. Now, as research is evolving with cloud-based online tools that automatically report usage data, it is important to review the standards and to clarify both ESSA’s useful advances and how the four tiers fail to address some critical scientific concepts. These concepts are needed for states, districts, and schools to make the best use of research findings.

We will cover the following in subsequent postings on this page.

  1. Evidence According to ESSA: Since the founding of the Institute of Education Sciences and NCLB in 2002, the philosophy of evaluation has been focused on the perfection of one good study. We’ll discuss the cost and technical issues this kind of study raises and how it sometimes reinforces educational inequity.
  2. Who is Served by Measuring Average Impact: The perfect design focused on the average impact of a program across all populations of students, teachers, and schools has value. Yet, school decision makers need to also know about the performance differences between specific groups such as students who are poor or middle class, teachers with one kind of preparation or another, or schools with AP courses vs. those without. Mark Schneider, IES’s director defines the IES mission as “IES is in the business of identifying what works for whom under what conditions.” This framing is a move toward a broader focus with more relevant results.
  3. Differential Impact is Unbiased. According to the ESSA standards, studies must statistically control for selection bias and other sources of bias. But biases that impact the average for the population in the study don’t impact the size of the differential effect between subgroups. The interaction between the program and the population characteristic is unaffected. And that’s what the educators need to know about.
  4. Putting many small studies together. Instead of the One Good Study approach we see the need for multiple studies each collecting data on differential impacts for subgroups. As Andrew Coulson of Mind Research Institute put it, we have to move from the One Good Study approach and on to valuing multiple studies with enough variety to be able to account for commonalities among districts. We add that meta-analysis of interaction effects are entirely feasible.

Our team works closely with the ESSA definitions and has addressed many of these issues. Our design for an RCE for the Dynamic Learning Project shows how Tiers 2 and 3 can be combined to answer questions involving intermediate results (or mediators). If you are interested in more information on the ESSA levels of evidence, the video on this page is a recorded webinar that provides clarification.

2020-05-12

Empirical Describes Innovative Approach to Research Design for Experiment on the Value of Instructional Technology Coaching

Empirical Education (Empirical) is collaborating with Digital Promise to evaluate the impact of the Dynamic Learning Project (DLP) on student achievement. The DLP provides school-based instructional technology coaches to participating districts to increase educational equity and impactful use of technology. Empirical is working with data from prior school years, allowing us to continue this work during this extraordinary time of school closures. We are conducting quasi-experiments in three school districts across the U.S. designed to provide evidence that will be useful to DLP stakeholders, including schools and districts considering using the DLP coaching model. Today, Empirical has released its design memo outlining its innovative approach to combining teacher-level and student-level outcomes through experimental and correlational methods.

Digital Promise— through funding and partnership with Google—launched the DLP in 2017 with more than 1,000 teachers in 50 schools across 18 districts in five states. The DLP expanded in the second year of implementation (2018-2019) with more than 100 schools reached across 23 districts in seven states. Digital Promise’s surveys of participating teachers have documented teachers’ belief in the DLP’s ability to improve instruction and increase impactful technology use (see Digital Promise’s extensive postings on the DLP). Our rapid cycle evaluations will work with data from the same cohorts, while adding district administrative data and data on technology use.

Empirical’s studies will establish valuable links between instructional coaching, technology use, and student achievement, all while helping to improve future iterations of the DLP coaching model. As described in our design memo, the study is guided by Digital Promise’s logic model. In this model, coaching is expected to affect an intermediate outcome, measured in Empirical’s research in terms of patterns of usage of edtech applications, as they implicate instructional practices. These patterns and practices are then expected to impact measurable student outcomes. The Empirical team will evaluate the impact of coaching on both the mediator (patterns and practices) and the student test outcomes. We will examine student-level outcomes by subgroup. The data are currently in the collection process, and we expect report publication by summer 2020.

2020-05-01

Updated Research on the Impact of Alabama’s Math, Science, and Technology Initiative (AMSTI) on Student Achievement

We are excited to release the findings of a new round of work conducted to continue our investigation of AMSTI. Alabama’s specialized training program for math and science teachers began over 20 years ago and now reaches over 900 schools across the state. As the program is constantly evolving to meet the demands of new standards and new assessment systems, the AMSTI team and the Alabama State Department of Education continue to support research to evaluate the program’s impact. Our new report builds on the work undertaken last year to answer three new research questions.

  1. What is the impact of AMSTI on reading achievement? We found a positive impact of AMSTI for students on the ACT Aspire reading assessment equivalent to 2 percentile points. This replicates a finding from our earlier 2012 study. This analysis used students of AMSTI-trained science teachers, as the training purposely integrates reading and writing practices into the science modules.
  2. What is the impact of AMSTI on early-career teachers? We found positive impacts of AMSTI for partially-trained math teachers and fully-trained science teachers. The sample of teachers for this analysis was those in their first three years of teaching, with varying levels of AMSTI training.
  3. How can AMSTI continue program development to better serve ELL students? Our earlier work found a negative impact of AMSTI training for ELL students in science. Building upon these results, we were able to identify a small subset of “model ELL AMSTI schools” where there was both a positive impact of AMSTI on ELL students, and where that impact was larger than any school-level effect on ELL students versus the entire sample. By looking at the site-specific best practices of these schools for supporting ELL students in science and across the board, the AMSTI team can start to incorporate these strategies into the program at large.

All research Empirical Education has conducted on AMSTI can be found on our AMSTI webpage.

2020-04-06

Presenting at AERA 2018

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in New York City from April 13-17, 2018.

Research presentations will include the following.

For Quasi-Experiments on EdTech Products, What Counts as Being Treated?
Authors: Val Lazarev, Denis Newman, & Malvika Bhagwat
In Roundtable Session: Examining the Impact of Accountability Systems on Both Teachers and Students
Friday, April 13 - 2:15 to 3:45pm
New York Marriott Marquis, Fifth Floor, Westside Ballroom Salon 3

Abstract: Edtech products are becoming increasingly prevalent in K-12 schools and the needs of schools to evaluate their value for students calls for a program of rigorous research, at least at the level 2 of the ESSA standards for evidence. This paper draws on our experience conducting a large scale quasi-experiment in California schools. The nature of the product’s wide-ranging intensity of implementation presented a challenge in identifying schools that had used the product adequately enough to be considered part of the treatment group.


Planning Impact Evaluations Over the Long Term: The Art of Anticipating and Adapting
Authors: Andrew P Jaciw & Thanh Thi Nguyen
In Session: The Challenges and Successes of Conducting Large-Scale Educational Research
Saturday, April 14 - 2:15 to 3:45pm
Sheraton New York Times Square, Second Floor, Central Park East Room

Abstract: Perspective. It is good practice to identify core research questions and important elements of study designs a-priori, to prevent post-hoc “fishing” exercises and reduce the role of drawing false-positive conclusions [16,19]. However, programs in education, and evaluations of them, evolve [6] making it difficult to follow a charted course. For example, in the lifetime of a program and its evaluation, new curricular content or evidence standards for evaluations may be introduced and thus drive changes in program implementation and evaluation.

Objectives. This work presents three cases from program impact evaluations conducted through the Department of Education. In each case, unanticipated results or changes in study context had significant consequences for program recipients, developers and evaluators. We discuss responses, either enacted or envisioned, for addressing these challenges. The work is intended to serve as a practical guide for researchers and evaluators who encounter similar issues.

Methods/Data Sources/Results. The first case concerns the problem of outcome measures keeping pace with evolving content standards. For example, in assessing impacts of science programs, program developers and evaluators are challenged to find assessments that align with Next Generation Science Standards (NGSS). Existing NGSS-aligned assessments are largely untested or in development, resulting in the evaluator having to find, adapt or develop instruments with strong reliability, and construct and face validity – ones that will be accepted by independent review and not considered over-aligned to the interventions. We describe a hands-on approach to working with a state testing agency to develop forms to assess impacts on science generally, and on constructs more-specifically aligned to the program evaluated. The second case concerns the problem of reprioritizing research questions mid-study. As noted above, researchers often identify primary (confirmatory) research questions at the outset of a study. Such questions are held to high evidence standards, and are differentiated from exploratory questions, which often originate after examining the data, and must be replicated to be considered reliable [16]. However, sometimes, exploratory analyses produce unanticipated results that may be highly consequential. The evaluator must grapple with the dilemma of whether to re-prioritize the result, or attempt to proceed with replication. We discuss this issue with reference to an RCT in which the dilemma arose. The third addresses the problem of designing and implementing a study that meets one set of evidence standards, when the results will be reviewed according to a later version of those standards. A practical question is what to do when this happens and consequently the study falls under a lower tier of the new evidence standard. With reference to an actual case, we consider several response options, including assessing the consequence of this reclassification for future funding of the program, and augmenting the research design to satisfy the new standards of evidence.

Significance. Responding to demands of changing contexts, programs in the social sciences are moving targets. They demand a flexible but well-reasoned and justified approach to evaluation. This session provides practical examples and is intended to promote discussion for generating solutions to challenges of this kind.


Indicators of Successful Teacher Recruitment and Retention in Oklahoma Rural Schools
Authors: Val Lazarev, Megan Toby, Jenna Lynn Zacamy, Denis Newman, & Li Lin
In Session: Teacher Effectiveness, Retention, and Coaching
Saturday, April 14 - 4:05 to 6:05pm
New York Marriott Marquis, Fifth Floor, Booth

Abstract: The purpose of this study was to identify factors associated with successful recruitment and retention of teachers in Oklahoma rural school districts, in order to highlight potential strategies to address Oklahoma’s teaching shortage. The study was designed to identify teacher-level, district-level, and community characteristics that predict which teachers are most likely to be successfully recruited and retained. A key finding is that for teachers in rural schools, total compensation and increased responsibilities in job assignment are positively associated with successful recruitment and retention. Evidence provided by this study can be used to inform incentive schemes to help retain certain groups of teachers and increase retention rates overall.


Teacher Evaluation Rubric Properties and Associations with School Characteristics: Evidence from the Texas Evaluation System
Authors: Val Lazarev, Thanh Thi Nguyen, Denis Newman, Jenna Lynn Zacamy, Li Lin
In Session: Teacher Evaluation Under the Microscope
Tuesday, April 17 - 12:25 to 1:55pm
New York Marriott Marquis, Seventh Floor, Astor Ballroom

Abstract: A 2009 seminal report, The Widget Effect, alerted the nation to the tendency of traditional teacher evaluation systems to treat teachers like widgets, undifferentiated in their level of effectiveness. Since then, a growing body of research, coupled with new federal initiatives, has catalyzed the reform of such systems. In 2014-15, Texas piloted its reformed evaluation system, collecting classroom observation rubric ratings from over 8000 teachers across 51 school districts. This study analyzed that large dataset and found that 26.5 percent, compared to 2 percent under previous measures, of teachers were rated below proficient. The study also found a promising indication of low bias in the rubric ratings stemming from school characteristics, given that they were minimally associated with observation ratings.

We look forward to seeing you at our sessions to discuss our research. We’re also co-hosting a cocktail reception with Division H! If you’d like an invite, let us know.

2018-03-06

How Efficacy Studies Can Help Decision-makers Decide if a Product is Likely to Work in Their Schools

We and our colleagues have been working on translating the results of rigorous studies of the impact of educational products, programs, and policies for people in school districts who are making the decisions whether to purchase or even just try out—pilot—the product. We are influenced by Stanford University Methodologist Lee Cronbach, especially his seminal book (1982) and article (1975) where he concludes “When we give proper weight to local conditions, any generalization is a working hypothesis, not a conclusion…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (p. 125). In other words, we consider even the best designed experiment to be like a case study, as much about the local and moderating role of context, as about the treatment when interpreting the causal effect of the program.

Following the focus on context, we can consider characteristics of the people and of the institution where the experiment was conducted to be co-causes of the result that deserve full attention—even though, technically, only the treatment, which was randomly assigned was controlled. Here we argue that any generalization from a rigorous study, where the question is whether the product is likely to be worth trying in a new district, must consider the full context of the study.

Technically, in the language of evaluation research, these differences in who or where the product or “treatment” works are called “interaction effects” between the treatment and the characteristic of interest (e.g., subgroups of students by demographic category or achievement level, teachers with different skills, or bandwidth available in the building). The characteristic of interest can be called a “moderator”, since it changes, or moderates, the impact of the treatment. An interaction reveals if there is differential impact and whether a group with a particular characteristic is advantaged, disadvantaged, or unaffected by the product.

The rules set out by The Department of Education’s What Works Clearinghouse (WWC) focus on the validity of the experimental conclusion: Did the program work on average compared to a control group? Whether it works better for poor kids than for middle class kids, works better for uncertified teachers versus veteran teachers, increases or closes a gap between English learners and those who are proficient, are not part of the information provided in their reviews. But these differences are exactly what buyers need in order to understand whether the product is a good candidate for a population like theirs. If a program works substantially better for English proficient students than for English learners, and the purchasing school has largely the latter type of student, it is important that the school administrator know the context for the research and the result.

The accuracy of an experimental finding depends on it not being moderated by conditions. This is recognized with recent methods of generalization (Tipton, 2013) that essentially apply non-experimental adjustments to experimental results to make them more accurate and more relevant to specific local contexts.

Work by Jaciw (2016a, 2016b) takes this one step further.

First, he confirms the result that if the impact of the program is moderated, and if moderators are distributed differently between sites, then an experimental result from one site will yield a biased inference for another site. This would be the case, for example, if the impact of a program depends on individual socioeconomic status, and there is a difference between the study and inference sites in the proportion of individuals with low socioeconomic status. Conditions for this “external validity bias” are well understood, but the consequences are addressed much less often than the usual selection bias. Experiments can yield accurate results about the efficacy of a program for the sample studied, but that average may not apply either to a subgroup within the sample or to a population outside the study.

Second, he uses results from a multisite trial to show empirically that there is potential for significant bias when inferring experimental results from one subset of sites to other inference sites within the study; however, moderators can account for much of the variation in impact across sites. Average impact findings from experiments provide a summary of whether a program works, but leaves the consumer guessing about the boundary conditions for that effect—the limits beyond which the average effect ceases to apply. Cronbach was highly aware of this, titling a chapter in his 1982 book “The Limited Reach of Internal Validity”. Using terms like “unbiased” to describe impact findings from experiments is correct in a technical sense (i.e., the point estimate, on hypothetical repeated sampling, is centered on the true average effect for the sample studied), but it can impart an incorrect sense of the external validity of the result: that it applies beyond the instance of the study.

Implications of the work cited, are, first, that it is possible to unpack marginal impact estimates through subgroup and moderator analyses to arrive at more-accurate inferences for individuals. Second, that we should do so—why obscure differences by paying attention to only the grand mean impact estimate for the sample? And third, that we should be planful in deciding which subgroups to assess impacts for in the context of individual experiments.

Local decision-makers’ primary concern should be with whether a program will work with their specific population, and to ask for causal evidence that considers local conditions through the moderating role of student, teacher, and school attributes. Looking at finer differences in impact may elicit criticism that it introduces another type of uncertainty—specifically from random sampling error—which may be minimal with gross impacts and large samples, but influential when looking at differences in impact with more and smaller samples. This is a fair criticism, but differential effects may be less susceptible to random perturbations (low power) than assumed, especially if subgroups are identified at individual levels in the context of cluster randomized trials (e.g., individual student-level SES, as opposed to school average SES) (Bloom, 2005; Jaciw, Lin, & Ma, 2016).

References:
Bloom, H. S. (2005). Randomizing groups to evaluate place-based programs. In H. S. Bloom (Ed.), Learning more from social experiments. New York: Russell Sage Foundation.

Cronbach, L. J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 116-127.

Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass.

Jaciw, A. P. (2016). Applications of a within-study comparison approach for evaluating bias in generalized causal inferences from comparison group studies. Evaluation Review, (40)3, 241-276. Retrieved from http://erx.sagepub.com/content/40/3/241.abstract

Jaciw, A. P. (2016). Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, (40)3, 199-240. Retrieved from http://erx.sagepub.com/content/40/3/199.abstract

Jaciw, A., Lin, L., & Ma, B. (2016). An empirical study of design parameters for assessing differential impacts for students in group randomized trials. Evaluation Review. Retrieved from http://erx.sagepub.com/content/early/2016/10/14/0193841X16659600.abstract

Tipton, E. (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239-266.

2018-01-16

ETIN Releases Guidelines for Research on Educational Technologies in K-12 Schools

The press release (below) was originally published on the SIIA website. It has since inspired stories in the Huffington Post, edscoop, EdWeek’s Market Brief, and the EdSurge newsletter.



ETIN Releases Guidelines for Research on Educational Technologies in K-12 Schools

Changes in education technology and policy spur updated approach to industry research

Washington, DC (July 25, 2017)The Education Technology Industry Network, a division of The Software & Information Industry Association, released an important new report today: “Guidelines for Conducting and Reporting EdTech Impact Research in U.S. K-12 Schools.” Authored by Dr. Denis Newman and the research team at Empirical Education Inc., the Guidelines provide 16 best practice standards of research for publishers and developers of educational technologies.

The Guidelines are a response to the changing research methods and policies driven by the accelerating pace of development and passage of the Every Student Succeeds Act (ESSA), which has challenged the static notion of evidence defined in NCLB. Recognizing the need for consensus among edtech providers, customers in the K-12 school market, and policy makers at all levels, SIIA is making these Guidelines freely available.

“SIIA members recognize that changes in technology and policy have made evidence of impact an increasingly critical differentiator in the marketplace,” said Bridget Foster, senior VP and managing director of ETIN. “The Guidelines show how research can be conducted and reported within a short timeframe and still contribute to continuous product improvement.”

“The Guidelines for research on edtech products is consistent with our approach to efficacy: that evidence of impact can lead to product improvement,” said Amar Kumar, senior vice president of Efficacy & Research at Pearson. “We appreciate ETIN’s leadership and Empirical Education’s efforts in putting together this clear presentation of how to use rigorous and relevant research to drive growth in the market.”

The Guidelines draw on over a decade of experience in conducting research in the context of the U.S. Department of Education’s Institute of Education Sciences, and its Investing in Innovation program.

“The current technology and policy environment provides an opportunity to transform how research is done,” said Dr. Newman, CEO of Empirical Education Inc. and lead author of the Guidelines. “Our goal in developing the new guidelines was to clarify current requirements in a way that will help edtech companies provide school districts with the evidence they need to consistently quantify the value of software tools. My thanks go to SIIA and the highly esteemed panel of reviewers whose contribution helped us provide the roadmap for the change that is needed.”

“In light of the ESSA evidence standards and the larger movement toward evidence-based reform, publishers and software developers are increasingly being called upon to show evidence that their products make a difference with children,” said Guidelines peer reviewer Dr. Robert Slavin, director of the Center for Research and Reform in Education, Johns Hopkins University. “The ETIN Guidelines provide practical, sensible guidance to those who are ready to meet these demands.”

ETIN’s goal is to improve the market for edtech products by advocating for greater transparency in reporting research findings. For that reason, it is actively working with government, policy organizations, foundations, and universities to gain the needed consensus for change.

“As digital instructional materials flood the market place, state and local leaders need access to evidence-based research regarding the effectiveness of products and services. This guide is a great step in supporting both the public and private sector to help ensure students and teachers have access to the most effective resources for learning,” stated Christine Fox, Deputy Executive Director, SETDA. The Guidelines can be downloaded here: https://www.empiricaleducation.com/research-guidelines.

2017-07-25

SIIA ETIN EIS Conference Presentations 2017


We are playing a major role in the Education Impact Symposium (EIS), organized by the Education Technology Industry Network (ETIN), a division of The Software & Information Industry Association (SIIA).

  1. ETIN is releasing a set of edtech research guidelines that CEO Denis Newman wrote this year
  2. Denis is speaking on 2 panels this year

The edtech research guidelines that Denis authored and ETIN is releasing on Tuesday, July 25 are called “Guidelines for Conducting and Reporting EdTech Impact Research in U.S. K-12 Schools” and can be downloaded from this webpage. The Guidelines are a much-needed response to a rapidly-changing environment of cloud-based technology and important policy changes brought about by the Every Student Succeeds Act (ESSA).

The panels Denis will be presenting on are both on Tuesday, July 25, 2017.

12:30 - 1:15pm
ETIN’s New Guidelines for Product Research in the ESSA Era
With the recent release of ETIN’s updated Guidelines for EdTech Impact Research, developers and publishers can ride the wave of change from NCLB’s sluggish concept of “scientifically-based” to ESSA’s dynamic view of “evidence” for continuous improvement. The Guidelines are being made publicly available at the Symposium, with a discussion and Q&A led by the lead author and some of the contributing reviewers.
Moderator:
Myron Cizdyn, Chief Executive Officer, The BLPS Group
Panelists:
Malvika Bhagwat, Research & Efficacy, Newsela
Amar Kumar, Sr. Vice President, Pearson
Denis Newman, CEO, Empirical Education Inc.
John Richards, President, Consulting Services for Education

2:30 - 3:30pm
The Many Faces of Impact
Key stakeholders in the EdTech Community will each review in Ted Talk style, what they are doing to increase impact of digital products, programs and services. Our line-up of presenters include:
- K-12 and HE content providers using impact data to better understand their customers improve their products, and support their marketing and sales teams
- an investor seeking impact on both disadvantaged populations and their financial return in order to make funding decisions for portfolio companies
- an education organization helping institutions decide what research is useful to them and how to grapple with new ESSA requirements
- a researcher working with product developers to produce evidence of the impact of their digital products

After the set of presenters have finished, we’ll have time for your questions on these multidimensional aspects of IMPACT and how technology can help.
Moderator:
Karen Billings, Principal, BillingsConnects
Panelists:
Jennifer Carolan, General Partner, Reach Capital
Christopher Cummings, VP, Institutional Product and Solution Design, Cengage
Melissa Greene, Director, Strategic Partnerships, SETDA
Denis Newman, CEO, Empirical Education Inc.
Kari Stubbs, PhD, Vice President, Learning & Innovation, BrainPOP

Jennifer Carolan, Denis Newman, and Chris Cummings on a panel at ETIN EIS

If you get a chance to check out the Guidelines before EIS, Denis would love to hear your thoughts about them at the conference.

2017-07-21
Archive