blog posts and news stories

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.


Factor Analysis Shows Facets of Teaching

The Empirical team has illustrated quantitatively what a lot of people have suspected. Basic classroom management, keeping things moving along, and the sense that the teacher is in control are most closely associated with achievement gains. We used teacher evaluation data collected by the Measures of Effective Teaching project to develop a three-factor model and found that only one factor was associated with VAM scores. Two other factors—one associated with constructivist pedagogy and the other with positive attitudes—were unrelated to short-term student outcomes. Val Lazarev and Denis Newman presented this work at the Association for Education Finance and Policy Annual Conference on March 13, 2014. And on May 7, Denis Newman and Kristen Koue conducted a workshop on the topic at the CCSSO’s SCEE Summit. The workshop emphasized the way that factors not directly associated with spring test scores can be very important in personnel decisions. The validation of these other factors may require connections to student achievements such as staying in school, getting into college, or pursuing a STEM career in years after the teacher’s direct contact.


Empirical Education Presents Initial Results from i3 Validation Grant Evaluation

Our director of research operations, Jenna Zacamy, joined Cheri Fancsali from IMPAQ International and Cyndy Greenleaf from the Strategic Literacy Initiative (SLI) at WestEd at the Literacy Research Association (LRA) conference in Dallas, TX on December 4. Together, they conducted a symposium, which was the first formal presentation of findings from the Investing in Innovation (i3) Validation grant, Reading Apprenticeship Improving Secondary Education (RAISE). WestEd won the grant in 2010 with Empirical Education and IMPAQ serving as the evaluators. There are two ongoing evaluations: the first includes a randomized control trial (RCT) of over 40 schools in California and Pennsylvania investigating the impact of Reading Apprenticeship on teacher instructional practices and student achievement; the second is a formative evaluation spanning four states and 150+ schools investigating how the school systems build capacity to implement and disseminate Reading Apprenticeship and sustain these efforts. The symposium’s discussant, P. David Pearson (UC Berkeley), provided praise of the design and effort of both studies stating that he has “never seen such thoughtful and extensive evaluations.” Preliminary findings from the RCT show that Reading Apprenticeship teachers provide students more opportunities to practice metacognitive strategies and foster and support more student collaboration opportunities. Findings from the second year of the formative evaluation suggest high levels of buy-in and commitment from both school administrators and teachers, but also identify competing initiatives and priorities as a primary challenge to sustainability. Initial findings of our five-year, multi-state study of RAISE are promising, but reflect the real-world complexity of scaling up and evaluating literacy initiatives across several contexts. Final results from both studies will be available in 2015.

View the information presented at LRA here and here.


Empirical Presents about Aspire Public School’s t3 System at AEA 2013

Empirical Education presented at the annual conference of the American Evaluation Association (AEA) in Washington, DC. Our newest research manager, Kristen Koue, along with our chief scientist, Andrew Jaciw reflected on striking the right balance between conducting a rigorous randomized control trial that meets i3 grant parameters, while also conducting an implementation evaluation that provides useful formative feedback to the Aspire population.


State Reports Show Almost All Teachers Are Effective or Highly So. Is This Good News?

The New York Times recently picked up a story, originally reported in Education Week two months ago, that school systems using formal methods for classroom observation as part of their educator evaluations are giving all but a very small percent of teachers high ratings—a phenomenon commonly known as the “widget effect.” The Times quotes Russ Whitehurst as suggesting that “It would be an unusual profession that at least 5 percent are not deemed ineffective.”

Responding to the story in her blog, Diane Ravitch calls it “unintentionally hilarious,” portraying the so-called reformers as upset that their own expensive evaluation methods are finding that most teachers are good at what they do. In closing, she asks, “Where did all those ineffective teachers go?”

We’re a research company working actively on teacher evaluation, so we’re interested in these kinds of questions. Should state-of-the-art observation protocols have found more teachers in the “needs improvement” category or at least 5% labeled “ineffective”? We present here an informal analysis meant to get an approximate answer, but based on data that was collected in a very rigorous manner. As one of the partners in the Gates Foundation’s Measures of Effective Teaching (MET) project, Empirical Education has access to a large dataset available for this examination, including videotaped lessons for almost 2,000 teachers coded according to a number of popular observational frameworks. Since the MET raters were trained intensively using methods approved by the protocol developers and had no acquaintance or supervisory relationship with the teachers in the videos, there is reason to think that the results show the kind of distribution intended by the developers of the observation methods. We can then compare the results in this controlled environment to the results referred to in the EdWeek and Times articles, which were based on reporting by state agencies. We used a simple (but reasonable) way of calculating the distribution of teachers in the MET data according to the categories in one popular protocol and compared it to the results reported by one of the states for a district known to have trained principals and other observers in the same protocol. We show the results here. The light bars show the distribution of the ratings in the MET data. We can see that a small percentage are rated “highly effective” and an equally small percentage “unsatisfactory.” So although the number doesn’t come up to the percent suggested by Russ Whitehurst, this well-developed method finds only 2% of a large sample of teachers to be in the bottom category. About 63% are considered “effective”, while a third are given a “needs improvement” rating. The dark bars are the ratings given by the school district using the same protocol. This shows a distribution typical of what EdWeek and the Times reported, where 97% are rated as “highly effective” or “effective.” It is interesting that the school district and MET research both found a very small percentage of unsatisfactory teachers.

Where we find a big difference is in the fact that the research program deemed only a small number of teachers to be exceptional while the school system used that category much more liberally. The other major difference is in the “needs improvement” category. When the observational protocol is used as designed, a solid number of teachers are viewed as doing OK but potentially doing much better. Both in research and in practice, the observational protocol divides most teachers between two categories. In the research setting, the distinction is between teachers who are effective and those who need improvement. In practice, users of the same protocol distinguish between effective and highly effective teachers. Both identify a small percent as unsatisfactory.

Our analysis suggests two problems with the use of the protocol in practice: first, the process does not provide feedback to teachers who are developing their skills, and, second, it does not distinguish between very good teachers and truly exceptional ones. We can imagine all sorts of practical pressures that, for the evaluators (principals, coaches and other administrators) decrease the value of identifying teachers who are less than fully effective and can benefit from developing specific skills. For example, unless all the evaluators in a district simultaneously agree to implement more stringent evaluations, then teachers in the schools where such evaluations are implemented will be disadvantaged. It will help to also have consistent training and calibration for the evaluators as well as accountability, which can be done with a fairly straightforward examination of the distribution of ratings.

Although this was a very informal analysis with a number of areas where we approximated results, we think we can conclude that Russ Whitehurst probably overstated the estimate of ineffective teachers but Diane Ravitch probably understated the estimate of teachers who could use some help and guidance in getting better at what they do.

Postscript. Because we are researchers and not committed to the validity of the observational methods, we need to state that we don’t know the extent to which the teachers labeled ineffective are generally less capable of raising student achievement. But researchers are notorious for ending all our reports with “more research is needed!”


Study Shows a “Singapore Math” Curriculum Can Improve Student Problem Solving Skills

A study of HMH Math in Focus (MIF) released today by research firm Empirical Education Inc. demonstrates a positive impact of the curriculum on Clark County School District elementary students’ math problem solving skills. The 2011-2012 study was contracted by the publisher, which left the design, conduct, and reporting to Empirical. MIF provides elementary math instruction based on the pedagogical approach used in Singapore. The MIF approach to instruction is designed to support conceptual understanding, and is said to be closely aligned with the Common Core State Standards (CCSS), which focuses more on in-depth learning than previous math standards.

Empirical found an increase in math problem solving among students taught with HMH Math in Focus compared to their peers. The Clark County School District teachers also reported an increase in their students’ conceptual understanding, as well as an increase in student confidence and engagement while explaining and solving math problems. The study addressed the difference between the CCSS-oriented MIF and the existing Nevada math standards and content. While MIF students performed comparatively better on complex problem solving skills, researchers found that students in the MIF group performed no better than the students in the control group on the measure of math procedures and computation skills. There was also no significant difference between the groups on the state CRT assessment, which has not fully shifted over to the CCSS.

The research used a group randomized control trial to examine the performance of students in grades 3-5 during the 2011-2012 school year. Each grade-level team was randomly assigned to either the treatment group that used MIF or the control group that used the conventional math curriculum. Researchers used three different assessments to capture math achievement contrasting procedural and problem solving skills. Additionally, the research design employed teacher survey data to conduct mediator analyses (correlations between percentage of math standards covered and student math achievement) and assess fidelity of classroom implementation.

You can download the report and research summary from the study using the links below.
Math in Focus research report
Math in Focus research summary


Empirical Starts on a 3rd Investing in Innovation (i3) Evaluation

This week was the kickoff meeting in Oakland, CA for a multi-year evaluation of WestEd’s iRAISE project, a grant to develop an online training system for their Reading Apprenticeship framework. iRAISE stands for Internet-based Reading Apprenticeship Improving Science Education. Being developed by WestEd’s Strategic Literacy Initiative (SLI), a prominent R&D group in this domain, iRAISE will provide a 65-hour online version of their conventional face-to-face professional development for high school science teachers. We are also contracted for the evaluation of the validation-level i3 grant to WestEd for a scaling up of Reading Apprenticeship, a project that received the third highest score in that year’s i3 competition. Additionally, Empirical is conducting the evaluation of Aspire Public Schools development grant in 2011. In this case we are evaluating their teacher effectiveness technology tools.

Further information on our capabilities working with i3 grants is located here.


Importance is Important for Rules of Evidence Proposed for ED Grant Programs

The U.S. Department of Education recently proposed new rules for including serious evaluations as part of its grant programs. The approach is modeled on how evaluations are used in the Investing in Innovation (i3) program where the proposal must show there’s some evidence that the proposed innovation has a chance of working and scaling and must include an evaluation that will add to a growing body of evidence about the innovation. We like this approach because it treats previous research as a hypothesis that the innovation may work in the new context. And each new grant is an opportunity to try the innovation in a new context, with improved approaches that warrant another check on effectiveness. But the proposed rules definitely had some weak points that were pointed out in the public comments, which are available online. We hope ED heeds these suggestions.

Mark Schneiderman representing the Software and Information Industry Association (SIIA) recommends that outcomes used in effectiveness studies should not be limited to achievement scores.

SIIA notes that grant program resources could appropriately address a range of purposes from instructional to administrative, from assessment to professional development, and from data warehousing to systems productivity. The measures could therefore include such outcomes as student test scores, teacher retention rates, changes in classroom practice or efficiency, availability and use of data or other student/teacher/school outcomes, and cost effectiveness and efficiency that can be observed and measured. Many of these outcome measures can also be viewed as intermediate outcomes—changes in practice that, as demonstrated by other research, are likely to affect other final outcomes.

He also points out that quality of implementation and the nature of the comparison group can be the deciding factors in whether or not a program is found to be effective.

SIIA notes that in education there is seldom a pure control condition such as can be achieved in a medical trial with a placebo or sugar pill. Evaluations of education products and services resemble comparative effectiveness trials in which a new medication is tested against a currently approved one to determine whether it is significantly better. The same product may therefore prove effective in one district that currently has a weak program but relatively less effective in another where a strong program is in place. As a result, significant effects can often be difficult to discern.

This point gets to the heart of the contextual issues in any experimental evaluation. Without understanding the local conditions of the experiment the size of the impact for any other context cannot be anticipated. Some experimentalists would argue that a massive multi-site trial would allow averaging across many contextual variations. But such “on average” results won’t necessarily help the decision-maker working in specific local conditions. Thus, taking previous results as a rough indication that an innovation is worth trying is the first step before conducting the grant-funded evaluation of a new variation of the innovation under new conditions.

Jon Baron, writing for the Coalition for Evidence Based Policy expresses a fundamental concern about what counts as evidence. Jon, who is a former Chair of the National Board for Education Sciences and has been a prominent advocate for basing policy on rigorous research, suggests that

“the definition of ‘strong evidence of effectiveness’ in §77.1 incorporate the Investing in Innovation Fund’s (i3) requirement for effects that are ‘substantial and important’ and not just statistically significant.”

He cites examples where researchers have reported statistically significant results, which were based on trivial outcomes or had impacts so small as to have no practical value. Including “substantial and important” as additional criteria also captures the SIIA’s point that it is not sufficient to consider the internal validity of the study—policy makers must consider whether the measure used is an important one or whether the treatment-control contrast allows for detecting a substantial impact.

Addressing the substance and importance of the results gets us appropriately into questions of external validity, and leads us to questions about subgroup impact, where, for example, an innovation has a positive impact “on average” and works well for high scoring students but provides no value for low scoring students. We would argue that a positive average impact is not the most important part of the picture if the end result is an increase in a policy-relevant achievement gap. Should ED be providing grants for innovations where there has been a substantial indication that a gap is worsened? Probably yes, but only if the proposed development is aimed at fixing the malfunctioning innovation and if the program evaluation can address this differential impact.


We Turned 10!

Happy birthday to us, happy birthday to us, happy birthday to Empirical Education, happy birthday to us!

This month we turn 10 years old! We can’t think of a better way to celebrate than with all of our friends at a birthday party at AERA next month.

If you aren’t able to attend our birthday party, we’ll also be presenting at SREE this week and at AERA next month.

Research Topics will include:

We look forward to seeing you at our sessions to discuss our research.

Pictures from the party are on our facebook page, but here’s a sneak peek.


Does 1 teacher = 1 number? Some Questions About the Research on Composite Measures of Teacher Effectiveness

We are all familiar with approaches to combining student growth metrics and other measures to generate a single measure that can be used to rate teachers for the purpose of personnel decisions. For example, as an alternative to using seniority as the basis for reducing the workforce, a school system may want to base such decisions—at least in part—on a ranking based on a number of measures of teacher effectiveness. One of the reports released January 8 by the Measures of Effective Teaching (MET) addressed approaches to creating a composite (i.e., a single number that averages various aspects of teacher performance) from multiple measures such as value-added modeling (VAM) scores, student surveys, and classroom observations. Working with the thousands of data points in the MET longitudinal database, the researchers were able to try out multiple statistical approaches to combining measures. The important recommendation from this research for practitioners is that, while there is no single best way to weight the various measures that are combined in the composite, balancing the weights more evenly tends to increase reliability.

While acknowledging the value of these analyses, we want to take a step back in this commentary. Here we ask whether agencies may sometimes be jumping to the conclusion that a composite is necessary when the individual measures (and even the components of these measures) may have greater utility than the composite for many purposes.

The basic premise behind creating a composite measure is the idea that there is an underlying characteristic that the composite can more or less accurately reflect. The criterion for a good composite is the extent to which the result accurately identifies a stable characteristic of the teacher’s effectiveness.

A problem with this basic premise is that in focusing on the common factor, the aspects of each measure that are unrelated to the common factor get left out—treated as noise in the statistical equation. But, what if observations and student surveys measure things that are unrelated to what the teacher’s students are able to achieve in a single year under her tutelage (the basis for a VAM score)? What if there are distinct domains of teacher expertise that have little relation to VAM scores? By definition, the multifaceted nature of teaching gets reduced to a single value in the composite.

This single value does have a use in decisions that require an unequivocal ranking of teachers, such as some personnel decisions. For most purposes, however, a multifaceted set of measures would be more useful. The single measure has little value for directing professional development, whereas the detailed output of the observation protocols are designed for just that. Consider a principal deciding which teachers to assign as mentors, or a district administrator deciding which teachers to move toward a principalship. Might it be useful, in such cases, to have several characteristics to represent different dimensions of abilities relevant to success in the particular roles?

Instead of collapsing the multitude of data points from achievement, surveys, and observations, consider an approach that makes maximum use of the data points to identify several distinct characteristics. In the usual method for constructing a composite (and in the MET research), the results for each measure (e.g., the survey or observation protocol) are first collapsed into a single number, and then these values are combined into the composite. This approach already obscures a large amount of information. The Tripod student survey provides scores on the seven Cs; an observation framework may have a dozen characteristics; and even VAM scores, usually thought of as a summary number, can be broken down (with some statistical limitations) into success with low-scoring vs. with high-scoring students (or any other demographic category of interest). Analyzing dozens of these data points for each teacher can potentially identify several distinct facets of a teacher’s overall ability. Not all facets will be strongly correlated with VAM scores but may be related to the teacher’s ability to inspire students in subsequent years to take more challenging courses, stay in school, and engage parents in ways that show up years later.

Creating a single composite measure of teaching has value for a range of administrative decisions. However, the mass of teacher data now being collected are only beginning to be tapped for improving teaching and developing schools as learning organizations.