blog posts and news stories

State Reports Show Almost All Teachers Are Effective or Highly So. Is This Good News?

The New York Times recently picked up a story, originally reported in Education Week two months ago, that school systems using formal methods for classroom observation as part of their educator evaluations are giving all but a very small percent of teachers high ratings—a phenomenon commonly known as the “widget effect.” The Times quotes Russ Whitehurst as suggesting that “It would be an unusual profession that at least 5 percent are not deemed ineffective.”

Responding to the story in her blog, Diane Ravitch calls it “unintentionally hilarious,” portraying the so-called reformers as upset that their own expensive evaluation methods are finding that most teachers are good at what they do. In closing, she asks, “Where did all those ineffective teachers go?”

We’re a research company working actively on teacher evaluation, so we’re interested in these kinds of questions. Should state-of-the-art observation protocols have found more teachers in the “needs improvement” category or at least 5% labeled “ineffective”? We present here an informal analysis meant to get an approximate answer, but based on data that was collected in a very rigorous manner. As one of the partners in the Gates Foundation’s Measures of Effective Teaching (MET) project, Empirical Education has access to a large dataset available for this examination, including videotaped lessons for almost 2,000 teachers coded according to a number of popular observational frameworks. Since the MET raters were trained intensively using methods approved by the protocol developers and had no acquaintance or supervisory relationship with the teachers in the videos, there is reason to think that the results show the kind of distribution intended by the developers of the observation methods. We can then compare the results in this controlled environment to the results referred to in the EdWeek and Times articles, which were based on reporting by state agencies. We used a simple (but reasonable) way of calculating the distribution of teachers in the MET data according to the categories in one popular protocol and compared it to the results reported by one of the states for a district known to have trained principals and other observers in the same protocol. We show the results here. The light bars show the distribution of the ratings in the MET data. We can see that a small percentage are rated “highly effective” and an equally small percentage “unsatisfactory.” So although the number doesn’t come up to the percent suggested by Russ Whitehurst, this well-developed method finds only 2% of a large sample of teachers to be in the bottom category. About 63% are considered “effective”, while a third are given a “needs improvement” rating. The dark bars are the ratings given by the school district using the same protocol. This shows a distribution typical of what EdWeek and the Times reported, where 97% are rated as “highly effective” or “effective.” It is interesting that the school district and MET research both found a very small percentage of unsatisfactory teachers.

Where we find a big difference is in the fact that the research program deemed only a small number of teachers to be exceptional while the school system used that category much more liberally. The other major difference is in the “needs improvement” category. When the observational protocol is used as designed, a solid number of teachers are viewed as doing OK but potentially doing much better. Both in research and in practice, the observational protocol divides most teachers between two categories. In the research setting, the distinction is between teachers who are effective and those who need improvement. In practice, users of the same protocol distinguish between effective and highly effective teachers. Both identify a small percent as unsatisfactory.

Our analysis suggests two problems with the use of the protocol in practice: first, the process does not provide feedback to teachers who are developing their skills, and, second, it does not distinguish between very good teachers and truly exceptional ones. We can imagine all sorts of practical pressures that, for the evaluators (principals, coaches and other administrators) decrease the value of identifying teachers who are less than fully effective and can benefit from developing specific skills. For example, unless all the evaluators in a district simultaneously agree to implement more stringent evaluations, then teachers in the schools where such evaluations are implemented will be disadvantaged. It will help to also have consistent training and calibration for the evaluators as well as accountability, which can be done with a fairly straightforward examination of the distribution of ratings.

Although this was a very informal analysis with a number of areas where we approximated results, we think we can conclude that Russ Whitehurst probably overstated the estimate of ineffective teachers but Diane Ravitch probably understated the estimate of teachers who could use some help and guidance in getting better at what they do.

Postscript. Because we are researchers and not committed to the validity of the observational methods, we need to state that we don’t know the extent to which the teachers labeled ineffective are generally less capable of raising student achievement. But researchers are notorious for ending all our reports with “more research is needed!”

2013-04-20

Study Shows a “Singapore Math” Curriculum Can Improve Student Problem Solving Skills

A study of HMH Math in Focus (MIF) released today by research firm Empirical Education Inc. demonstrates a positive impact of the curriculum on Clark County School District elementary students’ math problem solving skills. The 2011-2012 study was contracted by the publisher, which left the design, conduct, and reporting to Empirical. MIF provides elementary math instruction based on the pedagogical approach used in Singapore. The MIF approach to instruction is designed to support conceptual understanding, and is said to be closely aligned with the Common Core State Standards (CCSS), which focuses more on in-depth learning than previous math standards.

Empirical found an increase in math problem solving among students taught with HMH Math in Focus compared to their peers. The Clark County School District teachers also reported an increase in their students’ conceptual understanding, as well as an increase in student confidence and engagement while explaining and solving math problems. The study addressed the difference between the CCSS-oriented MIF and the existing Nevada math standards and content. While MIF students performed comparatively better on complex problem solving skills, researchers found that students in the MIF group performed no better than the students in the control group on the measure of math procedures and computation skills. There was also no significant difference between the groups on the state CRT assessment, which has not fully shifted over to the CCSS.

The research used a group randomized control trial to examine the performance of students in grades 3-5 during the 2011-2012 school year. Each grade-level team was randomly assigned to either the treatment group that used MIF or the control group that used the conventional math curriculum. Researchers used three different assessments to capture math achievement contrasting procedural and problem solving skills. Additionally, the research design employed teacher survey data to conduct mediator analyses (correlations between percentage of math standards covered and student math achievement) and assess fidelity of classroom implementation.

You can download the report and research summary from the study using the links below.
Math in Focus research report
Math in Focus research summary

2013-04-01

Empirical Starts on a 3rd Investing in Innovation (i3) Evaluation

This week was the kickoff meeting in Oakland, CA for a multi-year evaluation of WestEd’s iRAISE project, a grant to develop an online training system for their Reading Apprenticeship framework. iRAISE stands for Internet-based Reading Apprenticeship Improving Science Education. Being developed by WestEd’s Strategic Literacy Initiative (SLI), a prominent R&D group in this domain, iRAISE will provide a 65-hour online version of their conventional face-to-face professional development for high school science teachers. We are also contracted for the evaluation of the validation-level i3 grant to WestEd for a scaling up of Reading Apprenticeship, a project that received the third highest score in that year’s i3 competition. Additionally, Empirical is conducting the evaluation of Aspire Public Schools development grant in 2011. In this case we are evaluating their teacher effectiveness technology tools.

Further information on our capabilities working with i3 grants is located here.

2013-03-22

Importance is Important for Rules of Evidence Proposed for ED Grant Programs

The U.S. Department of Education recently proposed new rules for including serious evaluations as part of its grant programs. The approach is modeled on how evaluations are used in the Investing in Innovation (i3) program where the proposal must show there’s some evidence that the proposed innovation has a chance of working and scaling and must include an evaluation that will add to a growing body of evidence about the innovation. We like this approach because it treats previous research as a hypothesis that the innovation may work in the new context. And each new grant is an opportunity to try the innovation in a new context, with improved approaches that warrant another check on effectiveness. But the proposed rules definitely had some weak points that were pointed out in the public comments, which are available online. We hope ED heeds these suggestions.

Mark Schneiderman representing the Software and Information Industry Association (SIIA) recommends that outcomes used in effectiveness studies should not be limited to achievement scores.

SIIA notes that grant program resources could appropriately address a range of purposes from instructional to administrative, from assessment to professional development, and from data warehousing to systems productivity. The measures could therefore include such outcomes as student test scores, teacher retention rates, changes in classroom practice or efficiency, availability and use of data or other student/teacher/school outcomes, and cost effectiveness and efficiency that can be observed and measured. Many of these outcome measures can also be viewed as intermediate outcomes—changes in practice that, as demonstrated by other research, are likely to affect other final outcomes.

He also points out that quality of implementation and the nature of the comparison group can be the deciding factors in whether or not a program is found to be effective.

SIIA notes that in education there is seldom a pure control condition such as can be achieved in a medical trial with a placebo or sugar pill. Evaluations of education products and services resemble comparative effectiveness trials in which a new medication is tested against a currently approved one to determine whether it is significantly better. The same product may therefore prove effective in one district that currently has a weak program but relatively less effective in another where a strong program is in place. As a result, significant effects can often be difficult to discern.

This point gets to the heart of the contextual issues in any experimental evaluation. Without understanding the local conditions of the experiment the size of the impact for any other context cannot be anticipated. Some experimentalists would argue that a massive multi-site trial would allow averaging across many contextual variations. But such “on average” results won’t necessarily help the decision-maker working in specific local conditions. Thus, taking previous results as a rough indication that an innovation is worth trying is the first step before conducting the grant-funded evaluation of a new variation of the innovation under new conditions.

Jon Baron, writing for the Coalition for Evidence Based Policy expresses a fundamental concern about what counts as evidence. Jon, who is a former Chair of the National Board for Education Sciences and has been a prominent advocate for basing policy on rigorous research, suggests that

“the definition of ‘strong evidence of effectiveness’ in §77.1 incorporate the Investing in Innovation Fund’s (i3) requirement for effects that are ‘substantial and important’ and not just statistically significant.”

He cites examples where researchers have reported statistically significant results, which were based on trivial outcomes or had impacts so small as to have no practical value. Including “substantial and important” as additional criteria also captures the SIIA’s point that it is not sufficient to consider the internal validity of the study—policy makers must consider whether the measure used is an important one or whether the treatment-control contrast allows for detecting a substantial impact.

Addressing the substance and importance of the results gets us appropriately into questions of external validity, and leads us to questions about subgroup impact, where, for example, an innovation has a positive impact “on average” and works well for high scoring students but provides no value for low scoring students. We would argue that a positive average impact is not the most important part of the picture if the end result is an increase in a policy-relevant achievement gap. Should ED be providing grants for innovations where there has been a substantial indication that a gap is worsened? Probably yes, but only if the proposed development is aimed at fixing the malfunctioning innovation and if the program evaluation can address this differential impact.

2013-03-17

We Turned 10!

Happy birthday to us, happy birthday to us, happy birthday to Empirical Education, happy birthday to us!

This month we turn 10 years old! We can’t think of a better way to celebrate than with all of our friends at a birthday party at AERA next month.

If you aren’t able to attend our birthday party, we’ll also be presenting at SREE this week and at AERA next month.

Research Topics will include:

We look forward to seeing you at our sessions to discuss our research.

Pictures from the party are on our facebook page, but here’s a sneak peek.

2013-03-05

Does 1 teacher = 1 number? Some Questions About the Research on Composite Measures of Teacher Effectiveness

We are all familiar with approaches to combining student growth metrics and other measures to generate a single measure that can be used to rate teachers for the purpose of personnel decisions. For example, as an alternative to using seniority as the basis for reducing the workforce, a school system may want to base such decisions—at least in part—on a ranking based on a number of measures of teacher effectiveness. One of the reports released January 8 by the Measures of Effective Teaching (MET) addressed approaches to creating a composite (i.e., a single number that averages various aspects of teacher performance) from multiple measures such as value-added modeling (VAM) scores, student surveys, and classroom observations. Working with the thousands of data points in the MET longitudinal database, the researchers were able to try out multiple statistical approaches to combining measures. The important recommendation from this research for practitioners is that, while there is no single best way to weight the various measures that are combined in the composite, balancing the weights more evenly tends to increase reliability.

While acknowledging the value of these analyses, we want to take a step back in this commentary. Here we ask whether agencies may sometimes be jumping to the conclusion that a composite is necessary when the individual measures (and even the components of these measures) may have greater utility than the composite for many purposes.

The basic premise behind creating a composite measure is the idea that there is an underlying characteristic that the composite can more or less accurately reflect. The criterion for a good composite is the extent to which the result accurately identifies a stable characteristic of the teacher’s effectiveness.

A problem with this basic premise is that in focusing on the common factor, the aspects of each measure that are unrelated to the common factor get left out—treated as noise in the statistical equation. But, what if observations and student surveys measure things that are unrelated to what the teacher’s students are able to achieve in a single year under her tutelage (the basis for a VAM score)? What if there are distinct domains of teacher expertise that have little relation to VAM scores? By definition, the multifaceted nature of teaching gets reduced to a single value in the composite.

This single value does have a use in decisions that require an unequivocal ranking of teachers, such as some personnel decisions. For most purposes, however, a multifaceted set of measures would be more useful. The single measure has little value for directing professional development, whereas the detailed output of the observation protocols are designed for just that. Consider a principal deciding which teachers to assign as mentors, or a district administrator deciding which teachers to move toward a principalship. Might it be useful, in such cases, to have several characteristics to represent different dimensions of abilities relevant to success in the particular roles?

Instead of collapsing the multitude of data points from achievement, surveys, and observations, consider an approach that makes maximum use of the data points to identify several distinct characteristics. In the usual method for constructing a composite (and in the MET research), the results for each measure (e.g., the survey or observation protocol) are first collapsed into a single number, and then these values are combined into the composite. This approach already obscures a large amount of information. The Tripod student survey provides scores on the seven Cs; an observation framework may have a dozen characteristics; and even VAM scores, usually thought of as a summary number, can be broken down (with some statistical limitations) into success with low-scoring vs. with high-scoring students (or any other demographic category of interest). Analyzing dozens of these data points for each teacher can potentially identify several distinct facets of a teacher’s overall ability. Not all facets will be strongly correlated with VAM scores but may be related to the teacher’s ability to inspire students in subsequent years to take more challenging courses, stay in school, and engage parents in ways that show up years later.

Creating a single composite measure of teaching has value for a range of administrative decisions. However, the mass of teacher data now being collected are only beginning to be tapped for improving teaching and developing schools as learning organizations.

2013-02-14

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31

Oklahoma Implements Empirical’s Observation Engine for Certification of Classroom Observers

Tulsa Public Schools, the Cooperative Council for Oklahoma Administration, and Empirical Education Inc. just announced the launch of Observation Engine to implement the Teacher and Leader Effectiveness program in the state of Oklahoma. Tulsa Public Schools has purchased Empirical Education’s Observation Engine, an online certification and calibration tool for measuring the reliability of administrators assigned to conduct classroom observations. Tulsa Public Schools developed the Tulsa Model for Observation and Evaluation, a framework for ensuring teaching effectiveness performance, as well as best practices for creating an environment for successful learning and student achievement. Nearly 500 school districts in the state are piloting the Tulsa Model evaluation system this year.

In order to support the dissemination of the Tulsa Model, the Cooperative Council for Oklahoma Administration (CCOSA) is training and administering calibration tests throughout the state to assess and certify the individuals who evaluate the state’s teachers. The Tulsa Model is embedded in Observation Engine to deliver an efficient online system for state-wide use by Oklahoma certified classroom observers. Observation Engine is allowing CCOSA to test approximately 2,000 observers over a span of two weeks.

Observation Engine was developed as part of The Bill and Melinda Gates Foundation’s Measures of Effective Teaching project in which Empirical Education has participated as a research partner conducting R&D on validity and reliability of observational measures. The web-based software was built by Empirical Education, which hosts and supports it for school systems nationwide.

For more details on these events, see the press announcement and our case study.

2012-10-10

Empirical Releases Final Report on HMH Fuse™ iPad App

Today Empirical and Houghton Mifflin Harcourt made the following announcement. You can download the report and research summary from the study using the links below.
Fuse research report
Fuse research summary

Study Shows HMH Fuse™ iPad® App Can Dramatically Improve Student Achievement

Strong implementation in Riverside Unified School District associated with nine-point increase in percentile standing

BOSTON – April 10, 2012 – A study of HMH Fuse: Algebra 1 app released today by research firm Empirical Education Inc. identifies implementation as a key factor in the success of mobile technology. The 2010–2011 study was a pilot of a new educational app from global education leader Houghton Mifflin Harcourt (HMH) that re-imagines the conventional textbook to fully deploy interactive features of the mobile device. The HMH Fuse platform encourages the use of personalized lesson plans by combining direct instruction, ongoing support, assessment and intervention in one easy-to-use suite of tools.

Empirical found that the iPad-using students in the four participating districts: Long Beach, Fresno, San Francisco and Riverside Unified School District (Riverside Unified), performed on average as well as their peers using the traditional textbook. However, after examining its own results, Riverside Unified found an increase in test scores among students taught with HMH Fuse compared to their peers. Empirical corroborated these results, finding a statistically significant impact equivalent to a nine-point percentile increase. The Riverside Unified teachers also reported substantially greater usage of the HMH Fuse app both in teaching and by the students in class.

“Education technology does not operate in a vacuum, and the research findings reinforce that with a supportive school culture and strategic implementation, technology can have a significant impact on student achievement,” said Linda Zecher, President and CEO of HMH. “We’re encouraged by the results of the study and the potential of mobile learning to accelerate student achievement and deepen understanding in difficult to teach subjects like algebra.”

Across all districts, the study found a positive effect on student attitudes toward math, and those students with positive attitudes toward math achieved higher scores on the California Standards Test.

The research design was a “gold standard” randomized control trial that examined the performance of eighth-grade students during the 2010-2011 school year. Each teacher’s classes were randomly assigned to either the treatment group that used the HMH Fuse app or the control group that used the conventional print format of the same content.

“The rapid pace of mobile technology’s introduction into K-12 education leaves many educators with important questions about its efficacy especially given their own resources and experience,” said Denis Newman, CEO of Empirical Education. “The results from Riverside highlight the importance of future research on mobile technologies that account for differences in teacher experience and implementation.”

To access the full research report, go to www.empiricaleducation.com. A white paper detailing the implementation and impact of HMH Fuse in Riverside is available on the HMH website.

2012-04-10

The Value of Looking at Local Results

The report we released today has an interesting history that shows the value of looking beyond the initial results of an experiment. Later this week, we are presenting a paper at AERA entitled “In School Settings, Are All RCTs Exploratory?” The findings we report from our experiment with an iPad application were part of the inspiration for this. If Riverside Unified had not looked at its own data, we would not, in the normal course of data analysis, have broken the results out by individual districts, and our conclusion would have been that there was no discernible impact of the app. We can cite many other cases where looking at subgroups leads us to conclusions different from the conclusion based on the result averaged across the whole sample. Our report on AMSTI is another case we will cite in our AERA paper.

We agree with the Institute of Education Sciences (IES) in taking a disciplined approach in requiring that researchers “call their shots” by naming the small number of outcomes considered most important in any experiment. All other questions are fine to look at but fall into the category of exploratory work. What we want to guard against, however, is the implication that answers to primary questions, which often are concerned with average impacts for the study sample as a whole, must apply to various subgroups within the sample, and therefore can be broadly generalized by practitioners, developers, and policy makers.

If we find an average impact but in exploratory analysis discover plausible, policy-relevant, and statistically strong differential effects for subgroups, then some doubt about completeness may be cast on the value of the confirmatory finding. We may not be certain of a moderator effect—for example—but once it comes to light, the value of the average impact can also be considered incomplete or misleading for practical purposes. If it is necessary to conduct an additional experiment to verify a differential subgroup impact, the same experiment may verify that the average impact is not what practitioners, developers, and policy makers should be concerned with.

In our paper at AERA, we are proposing that any result from a school-based experiment should be treated as provisional by practitioners, developers, and policy makers. The results of RCTs can be very useful, but the challenges of generalizability of the results from even the most stringently designed experiment mean that the results should be considered the basis for a hypothesis that the intervention may work under similar conditions. For a developer considering how to improve an intervention, the specific conditions under which it appeared to work or not work is the critical information to have. For a school system decision maker, the most useful pieces of information are insight into subpopulations that appear to benefit and conditions that are favorable for implementation. For those concerned with educational policy, it is often the case that conditions and interventions change and develop more rapidly than research studies can be conducted. Using available evidence may mean digging through studies that have confirmatory results in contexts similar or different from their own and examining exploratory analyses that provide useful hints as to the most productive steps to take next. The practitioner in this case is in a similar position to the researcher considering the design of the next experiment. The practitioner also has to come to a hypothesis about how things work as the basis for action.

2012-04-01
Archive