Empirical Education Inc.

IES Releases New Empirical Education Report on Educator Effectiveness

Just released by IES, our report examines the statistical properties of Arizona’s new multiple-measure teacher evaluation model. The study used data from the pilot in 2012-13 to explore the relationships among the system’s component measures (teacher observations, student academic progress, and stakeholder surveys). It also investigated how well the model differentiated between higher and lower performing teachers. Findings suggest that the model’s observation measure may be improved through further evaluator training and calibration, and that a single aggregated composite score may not adequately represent independent aspects of teacher performance.

The study was carried out in partnership with the Arizona Department of Education as part of our work with the Regional Education Laboratory (REL) West’s Educator Effectiveness Alliance, which includes Arizona, Utah, and Nevada Department of Education officials, as well as teacher union representatives, district administrators, and policymakers. While the analysis is specific to Arizona’s model, the study findings and methodology may be of interest to other state education agencies that are developing of implementing new multiple-measure evaluation systems. We have continued this work with additional analyses for alliance members and plan to provide additional capacity building during 2015.

2014-12-16

Posted by: Robin Means

Tags: calibration, evaluation, IES, methodology, pilot, REL, report, SEA and teacher

Empirical to Evaluate Two More Winning i3 Projects

U.S. Department of Education has announced the highest-rated applicants for the 2014 Investing in Innovation (i3) competition. Of the 434 submissions received by ED, we are excited that both of the proposals for which we developed evaluation plans were among the 26 winners. Both of these proposals were THE highest rated in their respective categories!

In one, we’ll be partnering with WestEd to evaluate the Making Sense of Science and Literacy program. Written as a validation proposal, this 5-year project will aim to strengthen teachers’ content knowledge, transform classroom practices, and boost student achievement.

The other highest-rated application was a development proposal submitted by the Atlanta Neighborhood Charter Schools. In this 5-year project, we will assess its 3-year residency model on the effectiveness of early career teachers.

Both projects were bid under the “priority” for teacher effectiveness. We have a long standing partnership with WestEd on i3 evaluations and Regional Lab projects. This is our first project with Atlanta Neighborhood Charter Schools, and it builds on our educator effectiveness work and our ongoing partnerships with charter schools, including our evaluation of an i3 Development effort by Aspire Public Schools.

For more information on our evaluation services and our work on i3 projects, please visit our i3 page and/or contact Robin Means.

2014-11-07

Posted by: Robin Means

Tags: ANCS, educator effectiveness, evaluation, grant, i3, Making Sense of Science, teacher effectiveness and WestEd

U.S. Department of Education Could Expand its Concept of Student Growth

The continuing debate about the use of student test scores as a part of teacher evaluation misses an essential point. A teacher’s influence on a student’s achievement does not end in spring when the student takes the state test (or is evaluated using any of the Student Learning Objectives methods). An inspiring teacher, or one that makes a student feel recognized, or one that digs a bit deeper into the subject matter, may be part of the reason that the student later graduates high school, gets into college, or pursues a STEM career. These are “student achievements,” but they are ones that show up years after a teacher had the student in her class. As a teacher is getting students to grapple with a new concept, the students may not demonstrate improvements on standardized tests that year. But the “value-added” by the teacher may show up in later years.

States and districts implementing educator evaluations as part of their NCLB waivers are very aware of the requirement that they must “use multiple valid measures in determining performance levels, including as a significant factor data on student growth …” Student growth is defined as change between points in time in achievement on assessments. Student growth defined in this way obscures a teacher’s contribution to a student’s later school career.

As a practical matter, it may seem obvious that for this year’s evaluation, we can’t use something that happens next year. But recent analyses of longitudinal data, reviewed in an excellent piece by Raudenbush show that it is possible to identify predictors of later student achievement associated with individual teacher practices and effectiveness. The widespread implementation of multiple-measure teacher evaluations is starting to accumulate just the longitudinal datasets needed to do these predictive analyses. On the basis of these analyses we may be able to validate many of the facets of teaching that we have found, in analyses of the MET data, to be unrelated to student growth as defined in the waiver requirements.

Insofar as we can identify, through classroom observations and surveys, practices and dispositions that are predictive of later student achievement such as college going, then we have validated those practices. Ultimately, we may be able to substitute classroom observations and surveys of students, peers, and parents for value-added modeling based on state tests and other ad hoc measures of student growth. We are not yet at that point, but the first step will be to recognize that a teacher’s influence on a student’s growth extends beyond the year she has the student in the class.

2014-08-30

Posted by: Denis Newman

Tags: achievement, classroom, datasets, educator evaluation, longitudinal, observation, student, teacher, teacher evaluation and value-added

Empirical Education Helps North Carolina to Train and Calibrate School Leaders in the North Carolina Educator Effectiveness System

Empirical Education, working with its partner BloomBoard, is providing calibration and training services for school administrators across the state of North Carolina. The use of Observation Engine began in June with a pilot of the integrated solution, and once fully deployed, will be available to all 115 districts in the state, reaching more than 6,000 school leaders and potentially 120,000 teachers in the process.

The partnership with BloomBoard gives users an easy-to-use, integrated platform and gives North Carolina Department of Public Instruction (NCDPI) a comprehensive online training and calibration solution for school administrators who will be evaluating teachers as part of the North Carolina Educator Evaluation System (NCEES). The platform will combine Empirical’s state-of-the-art observer training and calibration tool, Observation Engine, with BloomBoard’s Professional Development Marketplace.

NCDPI Director of Education Effectiveness, Lynne Johnson, is excited about the potential for the initiative. “The BloomBoard-Empirical partnership is an innovative new approach that will help change the way our state personalizes the training, professional development, and calibration of our educators,” says Johnson. “We look forward to working with partners that continue to change the future of U.S. education.”

Read the press release.

2014-08-07

Posted by: Robin Means

Tags: BloomBoard, calibration, NCDPI, Observation Engine, observer, pd and teacher evaluation

Understanding Logic Models Workshop Series

On July 17, Empirical Education facilitated the first of two workshops for practitioners in New Mexico on the development of program logic models, one of the first steps in developing a research agenda. The workshop, entitled “Identifying Essential Logic Model Components, Definitions, and Formats”, introduced the general concepts, purposes, and uses of program logic models to members of the Regional Education Lab (REL) Southwest’s New Mexico Achievement Gap Research Alliance. Throughout the workshop, participants collaborated with facilitators to build a logic model for a program or policy that participants are working on or that is of interest.

Empirical Education is part of the REL Southwest team, which assists Arkansas, Louisiana, New Mexico, Oklahoma, and Texas in using data and research evidence to address high-priority regional needs, including charter school effectiveness, early childhood education, Hispanic achievement in STEM, rural school performance, and closing the achievement gap, through six research alliances. The logic model workshops aim to strengthen the technical capacity of New Mexico Achievement Gap Research Alliance members to understand and visually represent their programs’ theories of change, identify key program components and outcomes, and use logic models to develop research questions. Both workshops are being held in Albuquerque, New Mexico.

2014-06-17

Posted by: Robin Means

Tags: achievement, alliance, data, education, effectiveness, evidence, logic model, New Mexico, policy, program, REL, research and workshop

Getting Different Results from the Same Program in Different Contexts

The spring 2014 conference of the Society for Research in Educational Effectiveness (SREE) gave us much food for thought concerning the role of replication of experimental results in social science research. If two research teams get the same result from experiments on the same program, that gives us confidence that the original result was not a fluke or somehow biased.

But in his keynote, John Ioannidis of Stanford showed that even in medical research, where the context can be more tightly controlled, replication very often fails—researchers get different results. The original finding may have been biased, for example, through the tendency to suppress null findings where no positive effect was found and over-report large, but potentially spurious results. Replication of a result over the long run helps us to get past the biases. Though not as glamorous as discovery, replication is fundamental to science, and educational science is no exception.

In the course of the conference, I was reminded that the challenge to conducting replication work is, in a sense, compounded in social science research. “Effect heterogeneity”—finding different results in different contexts—is common for many legitimate reasons. For instance, experimental controls seldom get placebos. They receive the program already in place, often referred to as “business as usual,” and this can vary across experiments of the same intervention and contribute to different results. Also, experiments of the same program carried out in different contexts are likely to be adapted given demands or affordances of the situation, and flexible implementation may lead to different results. The challenge is to disentangle differences in effects that give insight into how programs are adapted in response to conditions, from bias in results that John Ioannidis considered. In other fields (e.g., the “hard sciences”), less context dependency and more-robust effects may make it easier to diagnose when variation in findings is illegitimate. In education, this may be more challenging and reminds me why educational research is in many ways the ‘hardest science’ of all, as David Berliner has emphasized in the past.

Once separated from distortions of bias and properly differentiated from the usual kind of “noise” or random error, differences in effects can actually be leveraged to better understand how and for whom programs work. Building systematic differences in conditions into our research designs can be revealing. Such efforts should, however, be considered with the role of replication in mind—an approach to research that purposively builds in heterogeneity, in a sense, seeks to find where impacts don’t replicate, but for good reason. Non-reproducibility in this case is not haphazard, it is purposive.

What are some approaches to leveraging and understanding effect heterogeneity? We envision randomized trials where heterogeneity is built into the design by comparing different versions of a program or implementing in diverse settings across which program effects are hypothesized to vary. A planning phase of an RCT would allow discussions with experts and stakeholders about potential drivers of heterogeneity. Pertinent questions to address during this period include: what are the attributes of participants and settings across which we expect effects to vary and why? Under which conditions and how do we expect program implementation to change? Hypothesizing which factors will moderate effects before the experiment is conducted would add credibility to results if they corroborate the theory. A thoughtful approach of this sort can be contrasted with the usual approach whereby differential effects of program are explored as afterthoughts, with the results carrying little weight.

Building in conditions for understanding effect heterogeneity will have implications for experimental design. Increasing variation in outcomes affects statistical power and the sensitivity of designs to detect effects. We will need a better understanding of the parameters affecting precision of estimates. At Empirical, we have started using results from several of our experiments to explore parameters affecting sensitivity of tests for detecting differential impact. For example, we have been documenting the variation across schools in differences in performance depending on student characteristics such as individual SES, gender, and LEP status. This variation determines how precisely we are able to estimate the average difference between student subgroups in the impact of a program.

Some may feel that introducing heterogeneity to better understand conditions for observing program effects is going down a slippery slope. Their thinking is that it is better to focus on program impacts averaged across the study population and to replicate those effects across conditions; and that building sources of variation into the design may lead to loose interpretations and loss of rigor in design and analysis. We appreciate the cautionary element of this position. However, we believe that a systematic study of how a program interacts with conditions can be done in a disciplined way without giving up the usual strategies for ensuring the validity of results.

We are excited about the possibility that education research is entering a period of disciplined scientific inquiry to better understand how differences in students, contexts, and programs interact, with the hope that the resulting work will lead to greater opportunity and better fit of program solutions to individuals.

2014-05-21

Posted by: Andrew Jaciw

Tags: conference, differential impact, experiment, heterogeneity, intervention, program effectiveness, RCT, replication, research, rigor, SREE and validity

Factor Analysis Shows Facets of Teaching

The Empirical team has illustrated quantitatively what a lot of people have suspected. Basic classroom management, keeping things moving along, and the sense that the teacher is in control are most closely associated with achievement gains. We used teacher evaluation data collected by the Measures of Effective Teaching project to develop a three-factor model and found that only one factor was associated with VAM scores. Two other factors—one associated with constructivist pedagogy and the other with positive attitudes—were unrelated to short-term student outcomes. Val Lazarev and Denis Newman presented this work at the Association for Education Finance and Policy Annual Conference on March 13, 2014. And on May 7, Denis Newman and Kristen Koue conducted a workshop on the topic at the CCSSO’s SCEE Summit. The workshop emphasized the way that factors not directly associated with spring test scores can be very important in personnel decisions. The validation of these other factors may require connections to student achievements such as staying in school, getting into college, or pursuing a STEM career in years after the teacher’s direct contact.

2014-05-09

Posted by: Robin Means

Tags: achievement, AEFP, conference, MET project, presentation, quantitative, teacher evaluation, validation, VAM and workshop

blog posts and news stories

IES Releases New Empirical Education Report on Educator Effectiveness

2014-12-16

Empirical to Evaluate Two More Winning i3 Projects

2014-11-07

U.S. Department of Education Could Expand its Concept of Student Growth

2014-08-30

Empirical Education Helps North Carolina to Train and Calibrate School Leaders in the North Carolina Educator Effectiveness System

2014-08-07

Understanding Logic Models Workshop Series

2014-06-17

Getting Different Results from the Same Program in Different Contexts

2014-05-21

Factor Analysis Shows Facets of Teaching

2014-05-09

Archive