blog posts and news stories

Five-year evaluation of Reading Apprenticeship i3 implementation reported at SREE

Empirical Education has released two research reports on the scale-up and impact of Reading Apprenticeship, as implemented under one of the first cohorts of Investing in Innovation (i3) grants. The Reading Apprenticeship Improving Secondary Education (RAISE) project reached approximately 2,800 teachers in five states with a program providing teacher professional development in content literacy in three disciplines: science, history, and English language arts. RAISE supported Empirical Education and our partner, IMPAQ International, in evaluating the innovation through both a randomized control trial encompassing 42 schools and a systematic study of the scale-up of 239 schools. The RCT found significant impact on student achievement in science classes consistent with prior studies. Mean impact across subjects, while positive, did not reach the .05 level of significance. The scale-up study found evidence that the strategy of building cross-disciplinary teacher teams within the school is associated with growth and sustainability of the program. Both sides of the evaluation were presented at the annual conference of the Society for Research on Educational Effectiveness, March 6-8, 2016 in Washington DC. Cheri Fancsali (formerly of IMPAQ, now at Research Alliance for NYC Schools) presented results of the RCT. Denis Newman (Empirical) presented a comparison of RAISE as instantiated in the RCT and scale-up contexts.

You can access the reports and research summaries from the studies using the links below.
RAISE RCT research report
RAISE RCT research summary
RAISE Scale-up research report
RAISE Scale-up research summary


Evaluation Concludes Aspire’s PD Tools Show Promise to Impact Classroom Practice

Empirical Education Inc. has completed an independent evaluation (read the report here) of a set of tools and professional development opportunities developed and implemented by Aspire Public Schools under an Investing in Innovation (i3) grant. Aspire was awarded the development grant in the 2011 funding cycle and put the system, Transforming Teacher Talent (t3), into operation in 2013 in their 35 California schools. The goal of t3 was to improve teacher practice as measured by the Aspire Instructional Rubric (AIR) and thereby improve student outcomes on the California Standards Test (CST), the state assessment. Some of the t3 components connected the AIR scores from classroom observations to individualized professional development materials building on tools from BloomBoard, Inc.

To evaluate t3, Empirical principal investigator, Andrew Jaciw and his team designed the strongest feasible evaluation. Since it was not possible to split the schools into two groups by having two versions of Aspire’s technology infrastructure supporting t3, a randomized experiment or other comparison group design was not feasible. Working with the National Evaluation of i3 (NEi3) team, Empirical developed a correlational design comparing two years of teacher AIR scores and student CST scores; that is, from the 2012-13 school year to the scores in the first year of implementation, 2013-14. Because the state was in a transition to new Common Core tests, the evaluation was unable to collect student outcomes systematically. The AIR scores, however, provided evidence of substantial overall improvement with an effect size of 0.581 standard deviations (p <.001). The evidence meets the standards for “evidence-based” as defined in the recently enacted Every Student Succeeds Act (ESSA), which requires, at the least, that the test of the intervention “demonstrates a statistically significant effect on improving…relevant outcomes based on…promising evidence from at least 1 well designed and well-implemented correlational study with statistical controls for selection bias.” A demonstration of promise can assist in obtaining federal and other funding.


SREE Spring 2016 Conference Presentations

We are excited to be presenting two topics at the annual Spring Conference of The Society for Research on Educational Effectiveness (SREE) next week. Our first presentation addresses the problem of using multiple pieces of evidence to support decisions. Our second presentation compares the context of an RCT with schools implementing the same program without those constraints. If you’re at SREE, we hope to run into you, either at one of these presentations (details below) or at one of yours.

Friday, March 4, 2016 from 3:30 - 5PM
Roosevelt (“TR”) - Ritz-Carlton Hotel, Ballroom Level

6E. Evaluating Educational Policies and Programs
Evidence-Based Decision-Making and Continuous Improvement

Chair: Robin Wisniewski, RTI International

Does “What Works”, Work for Me?: Translating Causal Impact Findings from Multiple RCTs of a Program to Support Decision-Making
Andrew P. Jaciw, Denis Newman, Val Lazarev, & Boya Ma, Empirical Education

Saturday, March 5, 2016 from 10AM - 12PM
Culpeper - Fairmont Hotel, Ballroom Level

Session 8F: Evaluating Educational Policies and Programs & International Perspectives on Educational Effectiveness
The Challenge of Scale: Evidence from Charters, Vouchers, and i3

Chair: Ash Vasudeva, Bill & Melinda Gates Foundation

Comparing a Program Implemented under the Constraints of an RCT and in the Wild
Denis Newman, Valeriy Lazarev, & Jenna Zacamy, Empirical Education


Learning Forward Presentation Highlights Fort Wayne Partnership

This past December, Teacher Evaluation Specialist K.C. MacQueen presented at the annual Learning Forward conference. MacQueen presented alongside Fort Wayne Community Schools’ (FWCS) Todd Cummings and Laura Cain, and Learning Forward’s Kay Psencik. The presentation titled, “Implementing Inter-Rater Reliability in a Learning System,” highlighted how FWCS has used Calibration & Certification Engine (CCE), School Improvement Network’s branded version of Observation Engine™, to ensure equitable evaluation of teacher effectiveness. FWCS detailed the process they used to engage instructional leaders in developing a common rubric vocabulary around their existing teacher observation rubric. While an uncommon step and one that definitely added to the implementation timeline, FWCS prioritized this collaboration and found that it increased both inter-rater reliability and buy-in to the process with the ultimate goal of assisting teachers in improving classroom instruction to result in greater student growth.


Feds Moving Toward a More Rational and Flexible Approach to Teacher Support and Evaluation

Congress is finally making progress on a bill to replace NCLB. Here’s an excerpt from a summary of the draft law.

Helps states support teachers– The bill provides resources to states and school districts to implement various activities to support teachers, principals, and other educators, including allowable uses of funds for high quality induction programs for new teachers, ongoing professional development opportunities for teachers, and programs to recruit new educators to the profession. Ends federal mandates on evaluations, allows states to innovate- The bill allows, but does not require, states to develop and implement teacher evaluation systems. This bill eliminates the definition of a highly qualified teacher—which has proven onerous to states and school districts—and provides states with the opportunity to define this term.

This is very positive. It makes teacher evaluation no longer an Obama-imposed requirement but allows states, that want to do it (and there are quite a few of those), to use federal funds to support it. It removes the irrational requirement that “student growth” be a major component of these systems. This will lower the reflexive resistance from unions because the purpose of evaluation can be more clearly associated with teacher support (for more on that argument, see the Real Clear Education piece). It will also encourage the use of observation and feedback from administrators and mentors. Removing the outmoded definition of “highly qualified teacher” opens up the possibility of wider use of research-based analyses of what is important to measure in effective teaching.

A summary is also provided by EdWeek. On a separate note, it says: “That new research and innovation program that some folks were describing as sort of a next generation ‘Investing in Innovation’ program made it into the bill. (Sens. Orrin Hatch, R-Utah, and Michael Bennet, D-Colo., are big fans, as is the administration.)”


Upcoming REL-SW Workshop Event

On November 19th, Erica Plut and Jenna Zacamy will join REL Southwest Alliance Liaison Haidee Williams in facilitating a workshop on Identifying Practices to Engage Native American Indian Families in Students’ Academic and Career Aspirations. The workshop is being offered to the Oklahoma Rural School Research Alliance members and their colleagues and will take place in Norman, Oklahoma. The goals of the workshop are:

  1. To increase alliance members’ knowledge and understanding of the research literature addressing promising practices to engage Native American Indian families in students’ academic and career aspirations
  2. To provide an opportunity to use the research literature to inform the refinement or development of family and community engagement programs or initiatives that are focused on students’ academic and career aspirations

You can find more information about this event on the IES website.


Unintended Consequences of Using Student Test Scores to Evaluate Teachers

There has been a powerful misconception driving policy in education. It’s a case where theory was inappropriately applied to practice. The misconception has had unintended consequences. It is helping to lead large numbers of parents to opt out of testing and could very well weaken the case in Congress for accountability as ESEA is reauthorized.

The idea that we can use student test scores as one of the measures in evaluating teachers came into vogue with Race to the Top. As a result of that and related federal policies, 38 states now include measures of student growth in teacher evaluations.

This was a conceptual advance over the NCLB definition of teacher quality in terms of preparation and experience. The focus on test scores was also a brilliant political move. The simple qualification for funding from Race to the Top—a linkage between teacher and student data—moved state legislatures to adopt policies calling for more rigorous teacher evaluations even without funding states to implement the policies. The simplicity of pointing to student achievement as the benchmark for evaluating teachers seemed incontrovertible.

It also had a scientific pedigree. Solid work had been accomplished by economists developing value-added modeling (VAM) to estimate a teacher’s contribution to student achievement. Hanushek et al.’s analysis is often cited as the basis for the now widely accepted view that teachers make the single largest contribution to student growth. The Bill and Melinda Gates Foundation invested heavily in its Measures of Effective Teaching (MET) project, which put the econometric calculation of teachers’ contribution to student achievement at the center of multiple measures.

The academic debates around VAM remain intense concerning the most productive statistical specification and evidence for causal inferences. Perhaps the most exciting area of research is in analyses of longitudinal datasets showing that students who have teachers with high VAM scores continue to benefit even into adulthood and career—not so much in their test scores as in their higher earnings, lower likelihood of having children as teenagers, and other results. With so much solid scientific work going on, what is the problem with applying theory to practice? While work on VAMs has provided important findings and productive research techniques, there are four important problems in applying these scientifically-based techniques to teacher evaluation.

First, and this is the thing that should have been obvious from the start, most teachers teach in grades or subjects where no standardized tests are given. If you’re conducting research, there is a wealth of data for math and reading in grades three through eight. However, if you’re a middle-school principal and there are standardized tests for only 20% of your teachers, you will have a problem using test scores for evaluation.

Nevertheless, federal policy required states—in order to receive a waiver from some of the requirements of NCLB—to institute teacher evaluation systems that use student growth as a major factor. To fill the gap in test scores, a few districts purchased or developed tests for every subject taught. A more wide-spread practice is the use of Student Learning Objectives (SLOs). Unfortunately, while they may provide an excellent process for reflection and goal setting between the principal and teacher, they lack the psychometric properties of VAMs, which allow administrators to objectively rank a teacher in relation to other teachers in the district. As the Mathematica team observed, “SLOs are designed to vary not only by grade and subject but also across teachers within a grade and subject.” By contrast, academic research on VAM gave educators and policy makers the impression that a single measure of student growth could be used for teacher evaluation across grades and subjects. It was a misconception unfortunately promoted by many VAM researchers who may have been unaware that the technique could only be applied to a small portion of teachers.

There are several additional reasons that test scores are not useful for teacher evaluation.

The second reason is that VAMs or other measures of student growth don’t provide any indication as to how a teacher can improve. If the purpose of teacher evaluation is to inform personnel decisions such as terminations, salary increases, or bonuses, then, at least for reading and math teachers, VAM scores would be useful. But we are seeing a widespread orientation toward using evaluations to inform professional development. Other kinds of measures, most obviously classroom observations conducted by a mentor or administrator—combined with feedback and guidance—provide a more direct mapping to where the teacher needs to improve. The observer-teacher interactions within an established framework also provide an appropriate managerial discretion in translating the evaluation into personnel decisions. Observation frameworks not only break the observation into specific aspects of practice but provide a rubric for scoring in four or five defined levels. A teacher can view the training materials used to calibrate evaluators to see what the next level looks like. VAM scores are opaque in contrast.

Third, test scores are associated with a narrow range of classroom practice. My colleague, Val Lazarev, and I found an interesting result from a factor analysis of the data collected in the MET project. MET collected classroom videos from thousands of teachers, which were then coded using a number of frameworks. The students were tested in reading and/or math using an assessment that was more focused on problem-solving and constructive items than is found in the usual state test. Our analysis showed that a teacher’s VAM score is more closely associated with the framework elements related to classroom and behavior management (i.e., keeping order in the classroom) than the more refined aspects of dialog with students. Keeping the classroom under control is a fundamental ability associated with good teaching but does not completely encompass what evaluators are looking for. Test scores, as the benchmark measure for effective teaching, may not be capturing many important elements.

Fourth, achievement test scores (and associated VAMs) are calculated based on what teachers can accomplish with respect to improving test scores from the time students appear in their classes in the fall to when they take the standardized test in the spring. If you ask people about their most influential teacher, they talk about being inspired to take up a particular career or about keeping them in school. These are results that are revealed in following years or even decades. A teacher who gets a student to start seeing math in a new way may not get immediate results on the spring test but may get the student to enroll in a more challenging course the next year. A teacher who makes a student feel at home in class may be an important part of the student not dropping out two years later. Whether or not teachers can cause these results is speculative. But the characteristics of warm, engaging, and inspiring teaching can be observed. We now have analytic tools and longitudinal datasets that can begin to reveal the association between being in a teacher’s class and the probability of a student graduating, getting into college, and pursuing a productive career. With records of systematic classroom observations, we may be able, in the future, to associate teaching practices with benchmarks that are more meaningful than the spring test score.

The policy-makers’ dream of an algorithm for translating test scores into teacher salary levels is a fallacy. Even the weaker provisions such as the vague requirement that student growth must be an important element among multiple measures in teacher evaluations has led to a profusion of methods of questionable utility for setting individual goals for teachers. But the insistence on using annual student achievement as the benchmark has led to more serious, perhaps unintended, consequences.

Teacher unions have had good reason to object to using test scores for evaluations. Teacher opposition to this misuse of test scores has reinforced a negative perception of tests as something that teachers oppose in general. The introduction of the new Common Core tests might have been welcomed by the teaching profession as a stronger alignment of the test with the widely shared belief about what is important for students to learn. But the change was opposed by the profession largely because it would be unfair to evaluate teachers on the basis of a test they had no experience preparing students for. Reducing the teaching profession’s opposition to testing may help reduce the clamor of the opt-out movement and keep the schools on the path of continuous improvement of student assessment.

We can return to recognizing that testing has value for teachers as formative assessment. And for the larger community it has value as assurance that schools and districts are maintaining standards, and most importantly, in considering the reauthorization of NCLB, not failing to educate subgroups of students who have the most need.

A final note. For purposes of program and policy evaluation, for understanding the elements of effective teaching, and for longitudinal tracking of the effect on students of school experiences, standardized testing is essential. Research on value-added modeling must continue and expand beyond tests to measure the effect of teachers on preparing students for “college and career”. Removing individual teacher evaluation from the equation will be a positive step toward having the data needed for evidence-based decisions.

An abbreviated version of this blog post can be found on Real Clear Education.


The New Look

We celebrated the launch of our newly designed website with champagne cocktails in the local park. The celebration was the perfect accompaniment to our website’s fresh new look, user-friendly navigation, and search functionality. The new site has several easy to use drop-down menus with updated reports and partner pages. The new design allows visitors to quickly find the information they seek. We hope that you will enjoy browsing our new site, and while you become acquainted, please use the links below to find things you might be searching for:

Reports and Papers

Research Capabilities

We look forward to keeping you updated on our latest projects on our new blog.

If you have feedback on the website, please contact our webmaster at


We Would Like to Introduce You to Our Newest Research Managers

The Empirical Research Team is pleased to announce the addition of 2 new team members. We welcome Erica Plut and Thanh Nguyen on board as our newest research managers!

Erica Plut, Research Manager

Erica has taken on management of the Ask A REL project for REL Midwest and is working with other Empirical staff in responding to stakeholder queries Erica’s teaching experience and Stanford education has also been an asset to the Observation Engine™ team in their development of the “Content Suite”, which provides school system administrators with pre-coded videos and justification feedback aligned to the observation rubric they use in evaluating the individual need for teacher professional development.

Thanh Nguyen, Research Manager

Thanh is taking on the role of lead project manager for our evaluation of the i3 grant for WestEd’s Making Sense of Science project. She has already plunged into the substance, attending the Making Sense of Science Facilitation Academy in Oakland at the WestEd offices and had the opportunity to meet the other key people on the project. Thanh’s knowledge of education research and experience in project management makes her the perfect fit on our team.


Meeting Long-time Friends and New Partners at #i3PD2015

Empirical sent five staff members to the Invest in Innovation (i3) Project Director’s meeting in Washington to support the five i3 evaluations we are currently conducting. Denis Newman, Andrew Jaciw, Jenna Zacamy, Megan Toby and Adam Schellinger. The meetings were filled with formal and informal meetings with partners, members of the i3 technical assistance teams, and old friends. Projects we are currently evaluating are RAISE, Aspire Public Schools, iRAISE, Making Sense of Science, and CREATE. We are currently at work on proposals for the 2016 round of awards.