blog posts and news stories

Empirical to Evaluate Two More Winning i3 Projects

U.S. Department of Education has announced the highest-rated applicants for the 2014 Investing in Innovation (i3) competition. Of the 434 submissions received by ED, we are excited that both of the proposals for which we developed evaluation plans were among the 26 winners. Both of these proposals were THE highest rated in their respective categories!

In one, we’ll be partnering with WestEd to evaluate the Making Sense of Science and Literacy program. Written as a validation proposal, this 5-year project will aim to strengthen teachers’ content knowledge, transform classroom practices, and boost student achievement.

The other highest-rated application was a development proposal submitted by the Atlanta Neighborhood Charter Schools. In this 5-year project, we will assess its 3-year residency model on the effectiveness of early career teachers.

Both projects were bid under the “priority” for teacher effectiveness. We have a long standing partnership with WestEd on i3 evaluations and Regional Lab projects. This is our first project with Atlanta Neighborhood Charter Schools, and it builds on our educator effectiveness work and our ongoing partnerships with charter schools, including our evaluation of an i3 Development effort by Aspire Public Schools.

For more information on our evaluation services and our work on i3 projects, please visit our i3 page and/or contact Robin Means.


U.S. Department of Education Could Expand its Concept of Student Growth

The continuing debate about the use of student test scores as a part of teacher evaluation misses an essential point. A teacher’s influence on a student’s achievement does not end in spring when the student takes the state test (or is evaluated using any of the Student Learning Objectives methods). An inspiring teacher, or one that makes a student feel recognized, or one that digs a bit deeper into the subject matter, may be part of the reason that the student later graduates high school, gets into college, or pursues a STEM career. These are “student achievements,” but they are ones that show up years after a teacher had the student in her class. As a teacher is getting students to grapple with a new concept, the students may not demonstrate improvements on standardized tests that year. But the “value-added” by the teacher may show up in later years.

States and districts implementing educator evaluations as part of their NCLB waivers are very aware of the requirement that they must “use multiple valid measures in determining performance levels, including as a significant factor data on student growth …” Student growth is defined as change between points in time in achievement on assessments. Student growth defined in this way obscures a teacher’s contribution to a student’s later school career.

As a practical matter, it may seem obvious that for this year’s evaluation, we can’t use something that happens next year. But recent analyses of longitudinal data, reviewed in an excellent piece by Raudenbush show that it is possible to identify predictors of later student achievement associated with individual teacher practices and effectiveness. The widespread implementation of multiple-measure teacher evaluations is starting to accumulate just the longitudinal datasets needed to do these predictive analyses. On the basis of these analyses we may be able to validate many of the facets of teaching that we have found, in analyses of the MET data, to be unrelated to student growth as defined in the waiver requirements.

Insofar as we can identify, through classroom observations and surveys, practices and dispositions that are predictive of later student achievement such as college going, then we have validated those practices. Ultimately, we may be able to substitute classroom observations and surveys of students, peers, and parents for value-added modeling based on state tests and other ad hoc measures of student growth. We are not yet at that point, but the first step will be to recognize that a teacher’s influence on a student’s growth extends beyond the year she has the student in the class.


Empirical Education Helps North Carolina to Train and Calibrate School Leaders in the North Carolina Educator Effectiveness System

Empirical Education, working with its partner BloomBoard, is providing calibration and training services for school administrators across the state of North Carolina. The use of Observation Engine began in June with a pilot of the integrated solution, and once fully deployed, will be available to all 115 districts in the state, reaching more than 6,000 school leaders and potentially 120,000 teachers in the process.

The partnership with BloomBoard gives users an easy-to-use, integrated platform and gives North Carolina Department of Public Instruction (NCDPI) a comprehensive online training and calibration solution for school administrators who will be evaluating teachers as part of the North Carolina Educator Evaluation System (NCEES). The platform will combine Empirical’s state-of-the-art observer training and calibration tool, Observation Engine, with BloomBoard’s Professional Development Marketplace.

NCDPI Director of Education Effectiveness, Lynne Johnson, is excited about the potential for the initiative. “The BloomBoard-Empirical partnership is an innovative new approach that will help change the way our state personalizes the training, professional development, and calibration of our educators,” says Johnson. “We look forward to working with partners that continue to change the future of U.S. education.”

Read the press release.


Understanding Logic Models Workshop Series

On July 17, Empirical Education facilitated the first of two workshops for practitioners in New Mexico on the development of program logic models, one of the first steps in developing a research agenda. The workshop, entitled “Identifying Essential Logic Model Components, Definitions, and Formats”, introduced the general concepts, purposes, and uses of program logic models to members of the Regional Education Lab (REL) Southwest’s New Mexico Achievement Gap Research Alliance. Throughout the workshop, participants collaborated with facilitators to build a logic model for a program or policy that participants are working on or that is of interest.

Empirical Education is part of the REL Southwest team, which assists Arkansas, Louisiana, New Mexico, Oklahoma, and Texas in using data and research evidence to address high-priority regional needs, including charter school effectiveness, early childhood education, Hispanic achievement in STEM, rural school performance, and closing the achievement gap, through six research alliances. The logic model workshops aim to strengthen the technical capacity of New Mexico Achievement Gap Research Alliance members to understand and visually represent their programs’ theories of change, identify key program components and outcomes, and use logic models to develop research questions. Both workshops are being held in Albuquerque, New Mexico.


Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.


Factor Analysis Shows Facets of Teaching

The Empirical team has illustrated quantitatively what a lot of people have suspected. Basic classroom management, keeping things moving along, and the sense that the teacher is in control are most closely associated with achievement gains. We used teacher evaluation data collected by the Measures of Effective Teaching project to develop a three-factor model and found that only one factor was associated with VAM scores. Two other factors—one associated with constructivist pedagogy and the other with positive attitudes—were unrelated to short-term student outcomes. Val Lazarev and Denis Newman presented this work at the Association for Education Finance and Policy Annual Conference on March 13, 2014. And on May 7, Denis Newman and Kristen Koue conducted a workshop on the topic at the CCSSO’s SCEE Summit. The workshop emphasized the way that factors not directly associated with spring test scores can be very important in personnel decisions. The validation of these other factors may require connections to student achievements such as staying in school, getting into college, or pursuing a STEM career in years after the teacher’s direct contact.


Empirical Education Presents Initial Results from i3 Validation Grant Evaluation

Our director of research operations, Jenna Zacamy, joined Cheri Fancsali from IMPAQ International and Cyndy Greenleaf from the Strategic Literacy Initiative (SLI) at WestEd at the Literacy Research Association (LRA) conference in Dallas, TX on December 4. Together, they conducted a symposium, which was the first formal presentation of findings from the Investing in Innovation (i3) Validation grant, Reading Apprenticeship Improving Secondary Education (RAISE). WestEd won the grant in 2010 with Empirical Education and IMPAQ serving as the evaluators. There are two ongoing evaluations: the first includes a randomized control trial (RCT) of over 40 schools in California and Pennsylvania investigating the impact of Reading Apprenticeship on teacher instructional practices and student achievement; the second is a formative evaluation spanning four states and 150+ schools investigating how the school systems build capacity to implement and disseminate Reading Apprenticeship and sustain these efforts. The symposium’s discussant, P. David Pearson (UC Berkeley), provided praise of the design and effort of both studies stating that he has “never seen such thoughtful and extensive evaluations.” Preliminary findings from the RCT show that Reading Apprenticeship teachers provide students more opportunities to practice metacognitive strategies and foster and support more student collaboration opportunities. Findings from the second year of the formative evaluation suggest high levels of buy-in and commitment from both school administrators and teachers, but also identify competing initiatives and priorities as a primary challenge to sustainability. Initial findings of our five-year, multi-state study of RAISE are promising, but reflect the real-world complexity of scaling up and evaluating literacy initiatives across several contexts. Final results from both studies will be available in 2015.

View the information presented at LRA here and here.


Empirical Presents about Aspire Public School’s t3 System at AEA 2013

Empirical Education presented at the annual conference of the American Evaluation Association (AEA) in Washington, DC. Our newest research manager, Kristen Koue, along with our chief scientist, Andrew Jaciw reflected on striking the right balance between conducting a rigorous randomized control trial that meets i3 grant parameters, while also conducting an implementation evaluation that provides useful formative feedback to the Aspire population.


State Reports Show Almost All Teachers Are Effective or Highly So. Is This Good News?

The New York Times recently picked up a story, originally reported in Education Week two months ago, that school systems using formal methods for classroom observation as part of their educator evaluations are giving all but a very small percent of teachers high ratings—a phenomenon commonly known as the “widget effect.” The Times quotes Russ Whitehurst as suggesting that “It would be an unusual profession that at least 5 percent are not deemed ineffective.”

Responding to the story in her blog, Diane Ravitch calls it “unintentionally hilarious,” portraying the so-called reformers as upset that their own expensive evaluation methods are finding that most teachers are good at what they do. In closing, she asks, “Where did all those ineffective teachers go?”

We’re a research company working actively on teacher evaluation, so we’re interested in these kinds of questions. Should state-of-the-art observation protocols have found more teachers in the “needs improvement” category or at least 5% labeled “ineffective”? We present here an informal analysis meant to get an approximate answer, but based on data that was collected in a very rigorous manner. As one of the partners in the Gates Foundation’s Measures of Effective Teaching (MET) project, Empirical Education has access to a large dataset available for this examination, including videotaped lessons for almost 2,000 teachers coded according to a number of popular observational frameworks. Since the MET raters were trained intensively using methods approved by the protocol developers and had no acquaintance or supervisory relationship with the teachers in the videos, there is reason to think that the results show the kind of distribution intended by the developers of the observation methods. We can then compare the results in this controlled environment to the results referred to in the EdWeek and Times articles, which were based on reporting by state agencies. We used a simple (but reasonable) way of calculating the distribution of teachers in the MET data according to the categories in one popular protocol and compared it to the results reported by one of the states for a district known to have trained principals and other observers in the same protocol. We show the results here. The light bars show the distribution of the ratings in the MET data. We can see that a small percentage are rated “highly effective” and an equally small percentage “unsatisfactory.” So although the number doesn’t come up to the percent suggested by Russ Whitehurst, this well-developed method finds only 2% of a large sample of teachers to be in the bottom category. About 63% are considered “effective”, while a third are given a “needs improvement” rating. The dark bars are the ratings given by the school district using the same protocol. This shows a distribution typical of what EdWeek and the Times reported, where 97% are rated as “highly effective” or “effective.” It is interesting that the school district and MET research both found a very small percentage of unsatisfactory teachers.

Where we find a big difference is in the fact that the research program deemed only a small number of teachers to be exceptional while the school system used that category much more liberally. The other major difference is in the “needs improvement” category. When the observational protocol is used as designed, a solid number of teachers are viewed as doing OK but potentially doing much better. Both in research and in practice, the observational protocol divides most teachers between two categories. In the research setting, the distinction is between teachers who are effective and those who need improvement. In practice, users of the same protocol distinguish between effective and highly effective teachers. Both identify a small percent as unsatisfactory.

Our analysis suggests two problems with the use of the protocol in practice: first, the process does not provide feedback to teachers who are developing their skills, and, second, it does not distinguish between very good teachers and truly exceptional ones. We can imagine all sorts of practical pressures that, for the evaluators (principals, coaches and other administrators) decrease the value of identifying teachers who are less than fully effective and can benefit from developing specific skills. For example, unless all the evaluators in a district simultaneously agree to implement more stringent evaluations, then teachers in the schools where such evaluations are implemented will be disadvantaged. It will help to also have consistent training and calibration for the evaluators as well as accountability, which can be done with a fairly straightforward examination of the distribution of ratings.

Although this was a very informal analysis with a number of areas where we approximated results, we think we can conclude that Russ Whitehurst probably overstated the estimate of ineffective teachers but Diane Ravitch probably understated the estimate of teachers who could use some help and guidance in getting better at what they do.

Postscript. Because we are researchers and not committed to the validity of the observational methods, we need to state that we don’t know the extent to which the teachers labeled ineffective are generally less capable of raising student achievement. But researchers are notorious for ending all our reports with “more research is needed!”


Study Shows a “Singapore Math” Curriculum Can Improve Student Problem Solving Skills

A study of HMH Math in Focus (MIF) released today by research firm Empirical Education Inc. demonstrates a positive impact of the curriculum on Clark County School District elementary students’ math problem solving skills. The 2011-2012 study was contracted by the publisher, which left the design, conduct, and reporting to Empirical. MIF provides elementary math instruction based on the pedagogical approach used in Singapore. The MIF approach to instruction is designed to support conceptual understanding, and is said to be closely aligned with the Common Core State Standards (CCSS), which focuses more on in-depth learning than previous math standards.

Empirical found an increase in math problem solving among students taught with HMH Math in Focus compared to their peers. The Clark County School District teachers also reported an increase in their students’ conceptual understanding, as well as an increase in student confidence and engagement while explaining and solving math problems. The study addressed the difference between the CCSS-oriented MIF and the existing Nevada math standards and content. While MIF students performed comparatively better on complex problem solving skills, researchers found that students in the MIF group performed no better than the students in the control group on the measure of math procedures and computation skills. There was also no significant difference between the groups on the state CRT assessment, which has not fully shifted over to the CCSS.

The research used a group randomized control trial to examine the performance of students in grades 3-5 during the 2011-2012 school year. Each grade-level team was randomly assigned to either the treatment group that used MIF or the control group that used the conventional math curriculum. Researchers used three different assessments to capture math achievement contrasting procedural and problem solving skills. Additionally, the research design employed teacher survey data to conduct mediator analyses (correlations between percentage of math standards covered and student math achievement) and assess fidelity of classroom implementation.

You can download the report and research summary from the study using the links below.
Math in Focus research report
Math in Focus research summary