Subscriber Login

Blog posts and news stories

Feds Moving Toward a More Rational and Flexible Approach to Teacher Support and Evaluation

Congress is finally making progress on a bill to replace NCLB. Here’s an excerpt from a summary of the draft law.

Helps states support teachers– The bill provides resources to states and school districts to implement various activities to support teachers, principals, and other educators, including allowable uses of funds for high quality induction programs for new teachers, ongoing professional development opportunities for teachers, and programs to recruit new educators to the profession. Ends federal mandates on evaluations, allows states to innovate- The bill allows, but does not require, states to develop and implement teacher evaluation systems. This bill eliminates the definition of a highly qualified teacher—which has proven onerous to states and school districts—and provides states with the opportunity to define this term.

This is very positive. It makes teacher evaluation no longer an Obama-imposed requirement but allows states, that want to do it (and there are quite a few of those), to use federal funds to support it. It removes the irrational requirement that “student growth” be a major component of these systems. This will lower the reflexive resistance from unions because the purpose of evaluation can be more clearly associated with teacher support (for more on that argument, see the Real Clear Education piece). It will also encourage the use of observation and feedback from administrators and mentors. Removing the outmoded definition of “highly qualified teacher” opens up the possibility of wider use of research-based analyses of what is important to measure in effective teaching.

A summary is also provided by EdWeek. On a separate note, it says: “That new research and innovation program that some folks were describing as sort of a next generation ‘Investing in Innovation’ program made it into the bill. (Sens. Orrin Hatch, R-Utah, and Michael Bennet, D-Colo., are big fans, as is the administration.)”


Upcoming REL-SW Workshop Event

On November 19th, Erica Plut and Jenna Zacamy will join REL Southwest Alliance Liaison Haidee Williams in facilitating a workshop on Identifying Practices to Engage Native American Indian Families in Students’ Academic and Career Aspirations. The workshop is being offered to the Oklahoma Rural School Research Alliance members and their colleagues and will take place in Norman, Oklahoma. The goals of the workshop are:

  1. To increase alliance members’ knowledge and understanding of the research literature addressing promising practices to engage Native American Indian families in students’ academic and career aspirations
  2. To provide an opportunity to use the research literature to inform the refinement or development of family and community engagement programs or initiatives that are focused on students’ academic and career aspirations

You can find more information about this event on the IES website.


Unintended Consequences of Using Student Test Scores to Evaluate Teachers

There has been a powerful misconception driving policy in education. It’s a case where theory was inappropriately applied to practice. The misconception has had unintended consequences. It is helping to lead large numbers of parents to opt out of testing and could very well weaken the case in Congress for accountability as ESEA is reauthorized.

The idea that we can use student test scores as one of the measures in evaluating teachers came into vogue with Race to the Top. As a result of that and related federal policies, 38 states now include measures of student growth in teacher evaluations.

This was a conceptual advance over the NCLB definition of teacher quality in terms of preparation and experience. The focus on test scores was also a brilliant political move. The simple qualification for funding from Race to the Top—a linkage between teacher and student data—moved state legislatures to adopt policies calling for more rigorous teacher evaluations even without funding states to implement the policies. The simplicity of pointing to student achievement as the benchmark for evaluating teachers seemed incontrovertible.

It also had a scientific pedigree. Solid work had been accomplished by economists developing value-added modeling (VAM) to estimate a teacher’s contribution to student achievement. Hanushek et al.’s analysis is often cited as the basis for the now widely accepted view that teachers make the single largest contribution to student growth. The Bill and Melinda Gates Foundation invested heavily in its Measures of Effective Teaching (MET) project, which put the econometric calculation of teachers’ contribution to student achievement at the center of multiple measures.

The academic debates around VAM remain intense concerning the most productive statistical specification and evidence for causal inferences. Perhaps the most exciting area of research is in analyses of longitudinal datasets showing that students who have teachers with high VAM scores continue to benefit even into adulthood and career—not so much in their test scores as in their higher earnings, lower likelihood of having children as teenagers, and other results. With so much solid scientific work going on, what is the problem with applying theory to practice? While work on VAMs has provided important findings and productive research techniques, there are four important problems in applying these scientifically-based techniques to teacher evaluation.

First, and this is the thing that should have been obvious from the start, most teachers teach in grades or subjects where no standardized tests are given. If you’re conducting research, there is a wealth of data for math and reading in grades three through eight. However, if you’re a middle-school principal and there are standardized tests for only 20% of your teachers, you will have a problem using test scores for evaluation.

Nevertheless, federal policy required states—in order to receive a waiver from some of the requirements of NCLB—to institute teacher evaluation systems that use student growth as a major factor. To fill the gap in test scores, a few districts purchased or developed tests for every subject taught. A more wide-spread practice is the use of Student Learning Objectives (SLOs). Unfortunately, while they may provide an excellent process for reflection and goal setting between the principal and teacher, they lack the psychometric properties of VAMs, which allow administrators to objectively rank a teacher in relation to other teachers in the district. As the Mathematica team observed, “SLOs are designed to vary not only by grade and subject but also across teachers within a grade and subject.” By contrast, academic research on VAM gave educators and policy makers the impression that a single measure of student growth could be used for teacher evaluation across grades and subjects. It was a misconception unfortunately promoted by many VAM researchers who may have been unaware that the technique could only be applied to a small portion of teachers.

There are several additional reasons that test scores are not useful for teacher evaluation.

The second reason is that VAMs or other measures of student growth don’t provide any indication as to how a teacher can improve. If the purpose of teacher evaluation is to inform personnel decisions such as terminations, salary increases, or bonuses, then, at least for reading and math teachers, VAM scores would be useful. But we are seeing a widespread orientation toward using evaluations to inform professional development. Other kinds of measures, most obviously classroom observations conducted by a mentor or administrator—combined with feedback and guidance—provide a more direct mapping to where the teacher needs to improve. The observer-teacher interactions within an established framework also provide an appropriate managerial discretion in translating the evaluation into personnel decisions. Observation frameworks not only break the observation into specific aspects of practice but provide a rubric for scoring in four or five defined levels. A teacher can view the training materials used to calibrate evaluators to see what the next level looks like. VAM scores are opaque in contrast.

Third, test scores are associated with a narrow range of classroom practice. My colleague, Val Lazarev, and I found an interesting result from a factor analysis of the data collected in the MET project. MET collected classroom videos from thousands of teachers, which were then coded using a number of frameworks. The students were tested in reading and/or math using an assessment that was more focused on problem-solving and constructive items than is found in the usual state test. Our analysis showed that a teacher’s VAM score is more closely associated with the framework elements related to classroom and behavior management (i.e., keeping order in the classroom) than the more refined aspects of dialog with students. Keeping the classroom under control is a fundamental ability associated with good teaching but does not completely encompass what evaluators are looking for. Test scores, as the benchmark measure for effective teaching, may not be capturing many important elements.

Fourth, achievement test scores (and associated VAMs) are calculated based on what teachers can accomplish with respect to improving test scores from the time students appear in their classes in the fall to when they take the standardized test in the spring. If you ask people about their most influential teacher, they talk about being inspired to take up a particular career or about keeping them in school. These are results that are revealed in following years or even decades. A teacher who gets a student to start seeing math in a new way may not get immediate results on the spring test but may get the student to enroll in a more challenging course the next year. A teacher who makes a student feel at home in class may be an important part of the student not dropping out two years later. Whether or not teachers can cause these results is speculative. But the characteristics of warm, engaging, and inspiring teaching can be observed. We now have analytic tools and longitudinal datasets that can begin to reveal the association between being in a teacher’s class and the probability of a student graduating, getting into college, and pursuing a productive career. With records of systematic classroom observations, we may be able, in the future, to associate teaching practices with benchmarks that are more meaningful than the spring test score.

The policy-makers’ dream of an algorithm for translating test scores into teacher salary levels is a fallacy. Even the weaker provisions such as the vague requirement that student growth must be an important element among multiple measures in teacher evaluations has led to a profusion of methods of questionable utility for setting individual goals for teachers. But the insistence on using annual student achievement as the benchmark has led to more serious, perhaps unintended, consequences.

Teacher unions have had good reason to object to using test scores for evaluations. Teacher opposition to this misuse of test scores has reinforced a negative perception of tests as something that teachers oppose in general. The introduction of the new Common Core tests might have been welcomed by the teaching profession as a stronger alignment of the test with the widely shared belief about what is important for students to learn. But the change was opposed by the profession largely because it would be unfair to evaluate teachers on the basis of a test they had no experience preparing students for. Reducing the teaching profession’s opposition to testing may help reduce the clamor of the opt-out movement and keep the schools on the path of continuous improvement of student assessment.

We can return to recognizing that testing has value for teachers as formative assessment. And for the larger community it has value as assurance that schools and districts are maintaining standards, and most importantly, in considering the reauthorization of NCLB, not failing to educate subgroups of students who have the most need.

A final note. For purposes of program and policy evaluation, for understanding the elements of effective teaching, and for longitudinal tracking of the effect on students of school experiences, standardized testing is essential. Research on value-added modeling must continue and expand beyond tests to measure the effect of teachers on preparing students for “college and career”. Removing individual teacher evaluation from the equation will be a positive step toward having the data needed for evidence-based decisions.

An abbreviated version of this blog post can be found on Real Clear Education.


The New Look

We celebrated the launch of our newly designed website with champagne cocktails in the local park. The celebration was the perfect accompaniment to our website’s fresh new look, user-friendly navigation, and search functionality. The new site has several easy to use drop-down menus with updated reports and partner pages. The new design allows visitors to quickly find the information they seek. We hope that you will enjoy browsing our new site, and while you become acquainted, please use the links below to find things you might be searching for:

Reports and Papers

Research Capabilities

We look forward to keeping you updated on our latest projects on our new blog.

If you have feedback on the website, please contact our webmaster at


We Would Like to Introduce You to Our Newest Research Managers

The Empirical Research Team is pleased to announce the addition of 2 new team members. We welcome Erica Plut and Thanh Nguyen on board as our newest research managers!

Erica Plut, Research Manager

Erica has taken on management of the Ask A REL project for REL Midwest and is working with other Empirical staff in responding to stakeholder queries Erica’s teaching experience and Stanford education has also been an asset to the Observation Engine™ team in their development of the “Content Suite”, which provides school system administrators with pre-coded videos and justification feedback aligned to the observation rubric they use in evaluating the individual need for teacher professional development.

Thanh Nguyen, Research Manager

Thanh is taking on the role of lead project manager for our evaluation of the i3 grant for WestEd’s Making Sense of Science project. She has already plunged into the substance, attending the Making Sense of Science Facilitation Academy in Oakland at the WestEd offices and had the opportunity to meet the other key people on the project. Thanh’s knowledge of education research and experience in project management makes her the perfect fit on our team.


Meeting Long-time Friends and New Partners at #i3PD2015

Empirical sent five staff members to the Invest in Innovation (i3) Project Director’s meeting in Washington to support the five i3 evaluations we are currently conducting. Denis Newman, Andrew Jaciw, Jenna Zacamy, Megan Toby and Adam Schellinger. The meetings were filled with formal and informal meetings with partners, members of the i3 technical assistance teams, and old friends. Projects we are currently evaluating are RAISE, Aspire Public Schools, iRAISE, Making Sense of Science, and CREATE. We are currently at work on proposals for the 2016 round of awards.


Work has Started on Analysis of Texas Educator Evaluation (T-TESS) Pilot

Empirical Education, through its contract with the REL Southwest, has begun the data collection process for an analysis of the Texas Teacher Evaluation and Support System (T-TESS) pilot conducted by the Texas Education Agency. This is announced on the IES site. Empirical’s Senior Research Scientist Val Lazarev is leading the analysis, which will focus on the elements and components of the system to better understand what T-TESS is measuring and provide alternative approaches to forming summative or composite scores.


Empirical Education Visits Chicago

We had such a great time in windy Chicago last month for the annual meeting of the American Educational Research Association (AERA). All of the presentations we attended were thought-provoking, and our presentations also seemed to be well-received.

The highlight of our trip, as always, was our annual reception. This year was our first time entertaining friends in a presidential suite, and the one at the Fairmont did not disappoint. Thanks to everyone who came and enjoyed an HLM, our signature cocktail. (The Hendricks Lemontwist Martini of course, what else would it stand for?)

Many of the pictures taken at our AERA reception can be found on facebook, but here is a sneak peek of some of our favorites.


Conference Season 2015

Empirical researchers are traveling all over the country this conference season. Come meet our researchers as we discuss our work at the following events. If you plan to attend any of these, please get in touch so we can schedule a time to speak with you, or come by to see us at our presentations.


We are pleased to announce that we will have our fifth appearance at the 40th annual conference of the Association for Education Finance and Policy (AEFP). Join us in the afternoon on Friday, February 27th at the Marriott Wardman Park, Washington DC as Empirical’s Senior Research Scientist Valeriy Lazarev and CEO Denis Newman present on Methods of Teacher Evaluation in Concurrent Session 7. Denis will also be the acting discussant and chair on Friday morning at 8am in Session 4.07 titled Preparation/Certification and Evaluation of Leaders/Teachers.


Attendees of this spring’s Society for Research on Effectiveness (SREE) Conference, held in Washington, DC March 5-7, will have the opportunity to discuss instructional strategies and programs to improve mathematics with Empirical Education’s Chief Scientist Andrew P. Jaciw. The presentation, Assessing Impacts of Math in Focus, a ‘Singapore Math’ Program for American Schools, will take place on Friday, March 6 at 1pm in the Park Hyatt Hotel, Ballroom Level Gallery 3.


This year’s 70th annual conference for ASCD will take place in Houston, TX on March 21-23. We invite you to schedule a meeting with CEO Denis Newman while he’s there.


We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in Chicago, Illinois from April 16-20, 2015. Our presentations will cover research under the Division H (Research, Evaluation, and Assessment in Schools) Section 2 symposium: Program Evaluation in Schools.

  1. Formative Evaluation on the Process of Scaling Up Reading Apprenticeship Authors: Jenna Lynn Zacamy, Megan Toby, Andrew P. Jaciw, and Denis Newman
  2. The Evaluation of Internet-based Reading Apprenticeship Improving Science Education (iRAISE) Authors: Megan Toby, Jenna Lynn Zacamy, Andrew P. Jaciw, and Denis Newman

We look forward to seeing you at our sessions to discuss our research. As soon as we have the schedule for these presentations, we will post them here. As has become tradition, we plan to host yet another of our popular AERA receptions. Details about the reception will follow in the months to come.


IES Releases New Empirical Education Report on Educator Effectiveness

Our report just released by IES examines the statistical properties of Arizona’s new multiple-measure teacher evaluation model. The study used data from the pilot in 2012-13 to explore the relationships among the system’s component measures (teacher observations, student academic progress, and stakeholder surveys). It also investigated how well the model differentiated between higher and lower performing teachers. Findings suggest that the model’s observation measure may be improved through further evaluator training and calibration, and that a single aggregated composite score may not adequately represent independent aspects of teacher performance.

The study was carried out in partnership with the Arizona Department of Education as part of our work with the Regional Education Laboratory (REL) West’s Educator Effectiveness Alliance, which includes Arizona, Utah, and Nevada Department of Education officials, as well as teacher union representatives, district administrators, and policymakers. While the analysis is specific to Arizona’s model, the study findings and methodology may be of interest to other state education agencies that are developing of implementing new multiple-measure evaluation systems. We have continued this work with additional analyses for alliance members and plan to provide additional capacity building during 2015.