blog posts and news stories

Agile Assessment and the Impact of Formative Testing on Student Achievement in Algebra

Empirical Education contracted with Jefferson Education Accelerator (JEA) to conduct a study on the effectiveness of formative testing for improving student achievement in Algebra. We partnered with a large urban school district in the northeast U.S. to evaluate their use of Agile Assessment. Developed by experts at the Charles A Dana Center at the University of Texas and education company Agile Mind, Agile Assessment is a flexible system for developing, administering, and analyzing student assessments that are aligned by standard, reading level, and level of difficulty. The district used benchmark Agile Assessments in the fall, winter, and spring to assess student performance in Algebra along with a curriculum it had chosen independent of assessments.

We conducted a quasi-experimental comparison group study using data from the 2016-17 school year and examined the impact of Agile Assessment usage on student achievement for roughly 1,000 students using the state standardized assessment in Algebra.

There were three main findings from the study:

  1. Algebra scores for students who used Agile Assessment were better than scores of comparison students. The result had an effect size of .30 (p = .01), which corresponds to a 12-percentile point gain, adjusting for differences in student demographics and pretest between treatment and comparison students.
  2. The positive impact of Agile Assessment generalized across many student subgroups, including Hispanic students, economically disadvantaged students and special education students.
  3. Outcomes on the state Algebra assessment were positively associated with the average score on the Agile Assessment benchmark tests. That said, adding the average score on Agile Assessment benchmark tests to the linear model increased its predictive power by a small amount.

These findings provide valuable evidence in favor of formative testing for the district and other stakeholders. Given disruptions in the current public school paradigm, increased frequency of formative assessment could provide visibility towards greater personalized instruction and ultimately increase student outcomes. You can read the full research report here.

2020-06-17

AERA 2018 Recap: The Possibilities and Necessity of a Rigorous Education Research Community

This year’s AERA annual meeting on “The Dreams, Possibilities, and Necessity of Public Education,” was fittingly held in the city with the largest number of public school students in the country—New York. Against this radically diverse backdrop, presenters were encouraged to diversify both the format and topics of presentations in order to inspire thinking and “confront the struggles for public education.”

AERA’s sheer size may risk overwhelming its attendees, but in other ways, it came as a relief. At a time when educators and education remain under-resourced, it was heartening to be reminded that a large, vibrant community of dedicated and intelligent people exists to improve educational opportunities for all students.

One theme that particularly stood out is that researchers are finding increasingly creative ways to use existing usage data from education technology products to measure impact and implementation. This is a good thing when it comes to reducing the cost of research and making it more accessible to smaller businesses and nonprofits. For example, in a presentation on a software-based knowledge competition for nursing students, researchers used usage data to identify components of player styles and determine whether these styles had a significant effect on student performance. In our Edtech Research Guidelines, Empirical similarly recommends that edtech companies take advantage of their existing usage data to run impact and implementation analyses, without using more expensive data collection methods. This can help significantly reduce the cost of research studies—rather than one study that costs $3 million, companies can consider multiple lower-cost studies that leverage usage data and give the company a picture of how the product performs in a greater diversity of contexts.

Empirical staff themselves presented on a variety of topics, including quasi-experiments on edtech products; teacher recruitment, evaluation, and retention; and long-term impact evaluations. In all cases, Empirical reinforced its commitment to innovative, low-cost, and rigorous research. You can read more about the research projects we presented in our previous AERA post.

photo of Denis Newman presenting at AERA 2018

Finally, Empirical was delighted to co-host the Division H AERA Reception at the Supernova bar at Novotel Hotel. If you ever wondered if Empirical knows how to throw a party, wonder no more! A few pictures from the event are below. View all of the pictures from our event on facebook!


We had a great time and look forward to seeing everyone at the next AERA annual meeting!

2018-05-03

Join Our Webinar: Measuring Ed Tech impact in the ESSA Era

Tuesday, November 7, 2017 … 2:00 - 3:00pm PT

Our CEO, Denis Newman, will be collaborating with Andrew Coulson (Chief Strategist, MIND Research Institute) and Bridget Foster (Senior VP and Managing Director, SIIA) to bring you an informative webinar next month!

This free webinar (Co-hosted by edWeb.net and MCH Strategic Data) will introduce you to a new approach to evidence about which edtech products really work in K-12 schools. ESSA has changed the game when it comes to what counts as evidence. This webinar builds on the Education Technology Industry Network’s (ETIN) recent publication of Guidelines for EdTech Impact Research that explains the new ground rules.

The presentation will explore how we can improve the conversation between edtech developers and vendors (providers), and the school district decision makers who are buying and/or piloting the products (buyers). ESSA has provided a more user-friendly definition of evidence, which facilitates the conversation.

  • Many buyers are asking providers if there’s reason to think their product is likely to work in a district like theirs.
  • For providers, the new ESSA rules let them start with simple studies to show their product shows promise without having to invest in expensive trials to prove it will work everywhere.

The presentation brings together two experts: Andrew Coulson, a developer who has conducted research on their products and is concerned with improving the efficacy of edtech, and Denis Newman, a researcher who is the lead author of the ETIN Guidelines. The presentation will be moderated by Bridget Foster, a long-time educator who now directs the ETIN at SIIA. This edWebinar will be of interest to edtech developers, school and district administrators, education policy makers, association leaders, and any educator interested in the evidence of efficacy in edtech.

If you would like to attend, click here to register.

2017-09-28

Presenting at AERA 2017

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in San Antonio, TX from April 27 – 30, 2017.

Research Presentations will include the following.

Increasing Accessibility of Professional Development (PD): Evaluation of an Online PD for High School Science Teachers
Authors: Adam Schellinger, Andrew P Jaciw, Jenna Lynn Zacamy, Megan Toby, & Li Lin
In Event: Promoting and Measuring STEM Learning
Saturday, April 29 10:35am to 12:05pm
Henry B. Gonzalez Convention Center, River Level, Room 7C

Abstract: This study examines the impact of an online teacher professional development, focused on academic literacy in high school science classes. A one-year randomized control trial measured the impact of Internet-Based Reading Apprenticeship Improving Science Education (iRAISE) on instructional practices and student literacy achievement in 27 schools in Michigan and Pennsylvania. Researchers found a differential impact of iRAISE favoring students with lower incoming achievement (although there was no overall impact of iRAISE on student achievement). Additionally, there were positive impacts on several instructional practices. These findings are consistent with the specific goals of iRAISE: to provide high-quality, accessible online training that improves science teaching. Authors compare these results to previous evaluations of the same intervention delivered through a face-to-face format.


How Teacher Practices Illuminate Differences in Program Impact in Biology and Humanities Classrooms
Authors: Denis Newman, Val Lazarev, Andrew P Jaciw, & Li Lin
In Event: Poster Session 5 - Program Evaluation With a Purpose: Creating Equal Opportunities for Learning in Schools
Friday, April 28 12:25 to 1:55pm
Henry B. Gonzalez Convention Center, Street Level, Stars at Night Ballroom 4

Abstract: This paper reports research to explain the positive impact in a major RCT for students in the classrooms of a subgroup of teachers. Our goal was to understand why there was an impact for science teachers but not for teachers of humanities, i.e., history and English. We have labelled our analysis “moderated mediation” because we start with the finding that the program’s success was moderated by the subject taught by the teacher and then go on to look at the differences in mediation processes depending on the subject being taught. We find that program impact teacher practices differ by mediator (as measured in surveys and observations) and that mediators are differentially associated with student impact based on context.


Are Large-Scale Randomized Controlled Trials Useful for Understanding the Process of Scaling Up?
Authors: Denis Newman, Val Lazarev, Jenna Lynn Zacamy, & Li Lin
In Event: Poster Session 3 - Applied Research in School: Education Policy and School Context
Thursday, April 27 4:05 to 5:35pm
Henry B. Gonzalez Convention Center, Ballroom Level, Hemisfair Ballroom 2

Abstract: This paper reports a large scale program evaluation that included an RCT and a parallel study of 167 schools outside the RCT that provided an opportunity for the study of the growth of a program and compare the two contexts. Teachers in both contexts were surveyed and a large subset of the questions are asked of both scale-up teachers and teachers in the treatment schools of the RCT. We find large differences in the level of commitment to program success in the school. Far less was found in the RCT suggesting that a large scale RCT may not be capturing the processes at play in the scale up of a program.

We look forward to seeing you at our sessions to discuss our research. You can also view our presentation schedule here.

2017-04-17

Unintended Consequences of Using Student Test Scores to Evaluate Teachers

There has been a powerful misconception driving policy in education. It’s a case where theory was inappropriately applied to practice. The misconception has had unintended consequences. It is helping to lead large numbers of parents to opt out of testing and could very well weaken the case in Congress for accountability as ESEA is reauthorized.

The idea that we can use student test scores as one of the measures in evaluating teachers came into vogue with Race to the Top. As a result of that and related federal policies, 38 states now include measures of student growth in teacher evaluations.

This was a conceptual advance over the NCLB definition of teacher quality in terms of preparation and experience. The focus on test scores was also a brilliant political move. The simple qualification for funding from Race to the Top—a linkage between teacher and student data—moved state legislatures to adopt policies calling for more rigorous teacher evaluations even without funding states to implement the policies. The simplicity of pointing to student achievement as the benchmark for evaluating teachers seemed incontrovertible.

It also had a scientific pedigree. Solid work had been accomplished by economists developing value-added modeling (VAM) to estimate a teacher’s contribution to student achievement. Hanushek et al.’s analysis is often cited as the basis for the now widely accepted view that teachers make the single largest contribution to student growth. The Bill and Melinda Gates Foundation invested heavily in its Measures of Effective Teaching (MET) project, which put the econometric calculation of teachers’ contribution to student achievement at the center of multiple measures.

The academic debates around VAM remain intense concerning the most productive statistical specification and evidence for causal inferences. Perhaps the most exciting area of research is in analyses of longitudinal datasets showing that students who have teachers with high VAM scores continue to benefit even into adulthood and career—not so much in their test scores as in their higher earnings, lower likelihood of having children as teenagers, and other results. With so much solid scientific work going on, what is the problem with applying theory to practice? While work on VAMs has provided important findings and productive research techniques, there are four important problems in applying these scientifically-based techniques to teacher evaluation.

First, and this is the thing that should have been obvious from the start, most teachers teach in grades or subjects where no standardized tests are given. If you’re conducting research, there is a wealth of data for math and reading in grades three through eight. However, if you’re a middle-school principal and there are standardized tests for only 20% of your teachers, you will have a problem using test scores for evaluation.

Nevertheless, federal policy required states—in order to receive a waiver from some of the requirements of NCLB—to institute teacher evaluation systems that use student growth as a major factor. To fill the gap in test scores, a few districts purchased or developed tests for every subject taught. A more wide-spread practice is the use of Student Learning Objectives (SLOs). Unfortunately, while they may provide an excellent process for reflection and goal setting between the principal and teacher, they lack the psychometric properties of VAMs, which allow administrators to objectively rank a teacher in relation to other teachers in the district. As the Mathematica team observed, “SLOs are designed to vary not only by grade and subject but also across teachers within a grade and subject.” By contrast, academic research on VAM gave educators and policy makers the impression that a single measure of student growth could be used for teacher evaluation across grades and subjects. It was a misconception unfortunately promoted by many VAM researchers who may have been unaware that the technique could only be applied to a small portion of teachers.

There are several additional reasons that test scores are not useful for teacher evaluation.

The second reason is that VAMs or other measures of student growth don’t provide any indication as to how a teacher can improve. If the purpose of teacher evaluation is to inform personnel decisions such as terminations, salary increases, or bonuses, then, at least for reading and math teachers, VAM scores would be useful. But we are seeing a widespread orientation toward using evaluations to inform professional development. Other kinds of measures, most obviously classroom observations conducted by a mentor or administrator—combined with feedback and guidance—provide a more direct mapping to where the teacher needs to improve. The observer-teacher interactions within an established framework also provide an appropriate managerial discretion in translating the evaluation into personnel decisions. Observation frameworks not only break the observation into specific aspects of practice but provide a rubric for scoring in four or five defined levels. A teacher can view the training materials used to calibrate evaluators to see what the next level looks like. VAM scores are opaque in contrast.

Third, test scores are associated with a narrow range of classroom practice. My colleague, Val Lazarev, and I found an interesting result from a factor analysis of the data collected in the MET project. MET collected classroom videos from thousands of teachers, which were then coded using a number of frameworks. The students were tested in reading and/or math using an assessment that was more focused on problem-solving and constructive items than is found in the usual state test. Our analysis showed that a teacher’s VAM score is more closely associated with the framework elements related to classroom and behavior management (i.e., keeping order in the classroom) than the more refined aspects of dialog with students. Keeping the classroom under control is a fundamental ability associated with good teaching but does not completely encompass what evaluators are looking for. Test scores, as the benchmark measure for effective teaching, may not be capturing many important elements.

Fourth, achievement test scores (and associated VAMs) are calculated based on what teachers can accomplish with respect to improving test scores from the time students appear in their classes in the fall to when they take the standardized test in the spring. If you ask people about their most influential teacher, they talk about being inspired to take up a particular career or about keeping them in school. These are results that are revealed in following years or even decades. A teacher who gets a student to start seeing math in a new way may not get immediate results on the spring test but may get the student to enroll in a more challenging course the next year. A teacher who makes a student feel at home in class may be an important part of the student not dropping out two years later. Whether or not teachers can cause these results is speculative. But the characteristics of warm, engaging, and inspiring teaching can be observed. We now have analytic tools and longitudinal datasets that can begin to reveal the association between being in a teacher’s class and the probability of a student graduating, getting into college, and pursuing a productive career. With records of systematic classroom observations, we may be able, in the future, to associate teaching practices with benchmarks that are more meaningful than the spring test score.

The policy-makers’ dream of an algorithm for translating test scores into teacher salary levels is a fallacy. Even the weaker provisions such as the vague requirement that student growth must be an important element among multiple measures in teacher evaluations has led to a profusion of methods of questionable utility for setting individual goals for teachers. But the insistence on using annual student achievement as the benchmark has led to more serious, perhaps unintended, consequences.

Teacher unions have had good reason to object to using test scores for evaluations. Teacher opposition to this misuse of test scores has reinforced a negative perception of tests as something that teachers oppose in general. The introduction of the new Common Core tests might have been welcomed by the teaching profession as a stronger alignment of the test with the widely shared belief about what is important for students to learn. But the change was opposed by the profession largely because it would be unfair to evaluate teachers on the basis of a test they had no experience preparing students for. Reducing the teaching profession’s opposition to testing may help reduce the clamor of the opt-out movement and keep the schools on the path of continuous improvement of student assessment.

We can return to recognizing that testing has value for teachers as formative assessment. And for the larger community it has value as assurance that schools and districts are maintaining standards, and most importantly, in considering the reauthorization of NCLB, not failing to educate subgroups of students who have the most need.

A final note. For purposes of program and policy evaluation, for understanding the elements of effective teaching, and for longitudinal tracking of the effect on students of school experiences, standardized testing is essential. Research on value-added modeling must continue and expand beyond tests to measure the effect of teachers on preparing students for “college and career”. Removing individual teacher evaluation from the equation will be a positive step toward having the data needed for evidence-based decisions.

An abbreviated version of this blog post can be found on Real Clear Education.

2015-09-10

District Data Study: Empirical’s Newest Research Product

Empirical Education introduces its newest offer: District Data StudyTM. Aimed at providing evidence of effectiveness, District Data Study assists vendors in conducting quantitative case studies using historical data from schools and districts currently engaged in a specific educational program.

There are two basic questions that can be cost-effectively answered given the available data.

  1. Are the outcomes (behavioral or academic) for students in schools that use the program better than outcomes of comparable students in schools not (or before) using the program?
  2. Is the amount of program usage associated with differences in outcomes?

The data studies result in concise reports on measurable academic and behavioral outcomes using appropriate statistical analyses of customer data from implementation of the educational product or program. District Data Study is built on efficient procedures and engineering infrastructure that can be applied to individual districts already piloting a program or veteran clients with longstanding implementation.

2011-11-20

Empirical Presents at AERA 2012

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in Vancouver, Canada from April 13 – 17, 2012. Our presentations will span two divisions: 1) Measurement and Research Methodology and 2) Research, Evaluation and Assessment in Schools.

Research Topics will include:

Current Studies in Program Evaluation to Improve Student Achievement Outcomes

Evaluating Alabama’s Math, Science and Technology Initiative: Results of a Three-Year, State-Wide Randomized Experiment

Accommodating Data From Quasi–Experimental Design

Quantitative Approaches to the Evaluation of Literacy Programs and Instruction for Elementary and Secondary Students

We look forward to seeing you at our sessions to discuss our research. You can also download our presentation schedule here. As has become tradition, we plan to host yet another of our popular AERA receptions. Details about the reception will follow in the months to come.

2011-11-18

Empirical's Chief Scientist co-authored a recently released NCEE Reference Report

Together with researchers from Abt Associates, Andrew Jaciw, Chief Scientist of Empirical Education, co–authored a recently released report entitled, “Estimating the Impacts of Educational Interventions Using State Tests or Study-Administered Tests”. The full report released by the The National Center for Education Evaluation and Regional Assistance (NCEE) can be found on the Institute of Education Sciences (IES) website.The NCEE Reference Report examines and identifies factors that could affect the precision of program evaluations when they are based on state assessments instead of study-administered tests. The authors found that using the same test for both the pre- and post-test yielded more precise impact estimates; using two pre-test covariates, one from each type of test (state assessment and study- administered standardized test), yielded more precise impact estimates; using as the dependent variable the simple average of the post-test scores from the two types of tests yielded more precise impact estimates and smaller sample size requirements than using post-test scores from only one of the two types of tests.

2011-11-02

Recognizing Success

When the Obama-Duncan administration approaches teacher evaluation, the emphasis is on recognizing success. We heard that clearly in Arne Duncan’s comments on the release of teacher value-added modeling (VAM) data for LA Unified by the LA Times. He’s quoted as saying, “What’s there to hide? In education, we’ve been scared to talk about success.” Since VAM is often thought of as a method for weeding out low performing teachers, Duncan’s statement referencing success casts the use of VAM in a more positive light. Therefore we want to raise the issue here: how do you know when you’ve found success? The general belief is that you’ll recognize it when you see it. But sorting through a multitude of variables is not a straightforward process, and that’s where research methods and statistical techniques can be useful. Below we illustrate how this plays out in teacher and in program evaluation.

As we report in our news story, Empirical is participating in the Gates Foundation project called Measures of Effective Teaching (MET). This project is known for its focus on value-added modeling (VAM) of teacher effectiveness. It is also known for having collected over 10,000 videos from over 2,500 teachers’ classrooms—an astounding accomplishment. Research partners from many top institutions hope to be able to identify the observable correlates for teachers whose students perform at high levels as well as for teachers whose students do not. (The MET project tested all the students with an “alternative assessment” in addition to using the conventional state achievement tests.) With this massive sample that includes both data about the students and videos of teachers, researchers can identify classroom practices that are consistently associated with student success. Empirical’s role in MET is to build a web-based tool that enables school system decision-makers to make use of the data to improve their own teacher evaluation processes. Thus they will be able to build on what’s been learned when conducting their own mini-studies aimed at improving their local observational evaluation methods.

When the MET project recently had its “leads” meeting in Washington DC, the assembled group of researchers, developers, school administrators, and union leaders were treated to an after-dinner speech and Q&A by Joanne Weiss. Joanne is now Arne Duncan’s chief of staff, after having directed the Race to the Top program (and before that was involved in many Silicon Valley educational innovations). The approach of the current administration to teacher evaluation—emphasizing that it is about recognizing success—carries over into program evaluation. This attitude was clear in Joanne’s presentation, in which she declared an intention to “shine a light on what is working.” The approach is part of their thinking about the reauthorization of ESEA, where more flexibility is given to local decision- makers to develop solutions, while the federal legislation is more about establishing achievement goals such as being the leader in college graduation.

Hand in hand with providing flexibility to find solutions, Joanne also spoke of the need to build “local capacity to identify and scale up effective programs.” We welcome the idea that school districts will be free to try out good ideas and identify those that work. This kind of cycle of continuous improvement is very different from the idea, incorporated in NCLB, that researchers will determine what works and disseminate these facts to the practitioners. Joanne spoke about continuous improvement, in the context of teachers and principals, where on a small scale it may be possible to recognize successful teachers and programs without research methodologies. While a teacher’s perception of student progress in the classroom may be aided by regular assessments, the determination of success seldom calls for research design. We advocate for a broader scope, and maintain that a cycle of continuous improvement is just as much needed at the district and state levels. At those levels, we are talking about identifying successful schools or successful programs where research and statistical techniques are needed to direct the light onto what is working. Building research capacity at the district and state level will be a necessary accompaniment to any plan to highlight successes. And, of course, research can’t be motivated purely by the desire to document the success of a program. We have to be equally willing to recognize failure. The administration will have to take seriously the local capacity building to achieve the hoped-for identification and scaling up of successful programs.

2010-11-18

Empirical Education at AERA 2011

Empirical is excited to announce that we will again have a strong showing at the 2011 American Educational Research Association (AERA) Conference. Join us in festive New Orleans, LA, April 8-12 for the final results on the efficacy of the PCI Reading Program, our findings from the first year of formative research on our MeasureResults program evaluation tool, and more. Visit our website in the coming months to view our AERA presentation schedule and details about our annual reception—we hope to see you there!

2010-11-15
Archive