blog posts and news stories

Pittsburgh Public Schools Uses 20 New Content Suite Videos


In June 2015, Pittsburgh Public Schools began using Observation Engine for calibration and training their teacher evaluators. They were one of our first clients to use the Content Suite to calibrate and certify classroom observers.

The Content Suite contains a collection of master scored videos along with thoughtful, objective score justifications on all observable elements of teaching called for by the evaluation framework used in the district. And it includes short video clips focused on one particular aspect of teaching. The combination of full-length and short clips makes it possible to easily and flexibly set up practice exercises, collaborative calibration sessions, and formal certification testing. Recently, we have added 20 new videos to the Content Suite collection!

The Content Suite can be used with most frameworks, either as-is or modified to ensure that the scores and justifications are consistent with the local context and observation framework interpretation. Observation Engine support staff will work closely with each client to modify content and design a customized implementation plan that meets the goals of the school system and sets up evaluators for success. For more information about the Content Suite, click here.

2016-11-10

U.S. Department of Education Could Expand its Concept of Student Growth

The continuing debate about the use of student test scores as a part of teacher evaluation misses an essential point. A teacher’s influence on a student’s achievement does not end in spring when the student takes the state test (or is evaluated using any of the Student Learning Objectives methods). An inspiring teacher, or one that makes a student feel recognized, or one that digs a bit deeper into the subject matter, may be part of the reason that the student later graduates high school, gets into college, or pursues a STEM career. These are “student achievements,” but they are ones that show up years after a teacher had the student in her class. As a teacher is getting students to grapple with a new concept, the students may not demonstrate improvements on standardized tests that year. But the “value-added” by the teacher may show up in later years.

States and districts implementing educator evaluations as part of their NCLB waivers are very aware of the requirement that they must “use multiple valid measures in determining performance levels, including as a significant factor data on student growth …” Student growth is defined as change between points in time in achievement on assessments. Student growth defined in this way obscures a teacher’s contribution to a student’s later school career.

As a practical matter, it may seem obvious that for this year’s evaluation, we can’t use something that happens next year. But recent analyses of longitudinal data, reviewed in an excellent piece by Raudenbush show that it is possible to identify predictors of later student achievement associated with individual teacher practices and effectiveness. The widespread implementation of multiple-measure teacher evaluations is starting to accumulate just the longitudinal datasets needed to do these predictive analyses. On the basis of these analyses we may be able to validate many of the facets of teaching that we have found, in analyses of the MET data, to be unrelated to student growth as defined in the waiver requirements.

Insofar as we can identify, through classroom observations and surveys, practices and dispositions that are predictive of later student achievement such as college going, then we have validated those practices. Ultimately, we may be able to substitute classroom observations and surveys of students, peers, and parents for value-added modeling based on state tests and other ad hoc measures of student growth. We are not yet at that point, but the first step will be to recognize that a teacher’s influence on a student’s growth extends beyond the year she has the student in the class.

2014-08-30

Understanding Logic Models Workshop Series

On July 17, Empirical Education facilitated the first of two workshops for practitioners in New Mexico on the development of program logic models, one of the first steps in developing a research agenda. The workshop, entitled “Identifying Essential Logic Model Components, Definitions, and Formats”, introduced the general concepts, purposes, and uses of program logic models to members of the Regional Education Lab (REL) Southwest’s New Mexico Achievement Gap Research Alliance. Throughout the workshop, participants collaborated with facilitators to build a logic model for a program or policy that participants are working on or that is of interest.

Empirical Education is part of the REL Southwest team, which assists Arkansas, Louisiana, New Mexico, Oklahoma, and Texas in using data and research evidence to address high-priority regional needs, including charter school effectiveness, early childhood education, Hispanic achievement in STEM, rural school performance, and closing the achievement gap, through six research alliances. The logic model workshops aim to strengthen the technical capacity of New Mexico Achievement Gap Research Alliance members to understand and visually represent their programs’ theories of change, identify key program components and outcomes, and use logic models to develop research questions. Both workshops are being held in Albuquerque, New Mexico.

2014-06-17

Factor Analysis Shows Facets of Teaching

The Empirical team has illustrated quantitatively what a lot of people have suspected. Basic classroom management, keeping things moving along, and the sense that the teacher is in control are most closely associated with achievement gains. We used teacher evaluation data collected by the Measures of Effective Teaching project to develop a three-factor model and found that only one factor was associated with VAM scores. Two other factors—one associated with constructivist pedagogy and the other with positive attitudes—were unrelated to short-term student outcomes. Val Lazarev and Denis Newman presented this work at the Association for Education Finance and Policy Annual Conference on March 13, 2014. And on May 7, Denis Newman and Kristen Koue conducted a workshop on the topic at the CCSSO’s SCEE Summit. The workshop emphasized the way that factors not directly associated with spring test scores can be very important in personnel decisions. The validation of these other factors may require connections to student achievements such as staying in school, getting into college, or pursuing a STEM career in years after the teacher’s direct contact.

2014-05-09

Importance is Important for Rules of Evidence Proposed for ED Grant Programs

The U.S. Department of Education recently proposed new rules for including serious evaluations as part of its grant programs. The approach is modeled on how evaluations are used in the Investing in Innovation (i3) program where the proposal must show there’s some evidence that the proposed innovation has a chance of working and scaling and must include an evaluation that will add to a growing body of evidence about the innovation. We like this approach because it treats previous research as a hypothesis that the innovation may work in the new context. And each new grant is an opportunity to try the innovation in a new context, with improved approaches that warrant another check on effectiveness. But the proposed rules definitely had some weak points that were pointed out in the public comments, which are available online. We hope ED heeds these suggestions.

Mark Schneiderman representing the Software and Information Industry Association (SIIA) recommends that outcomes used in effectiveness studies should not be limited to achievement scores.

SIIA notes that grant program resources could appropriately address a range of purposes from instructional to administrative, from assessment to professional development, and from data warehousing to systems productivity. The measures could therefore include such outcomes as student test scores, teacher retention rates, changes in classroom practice or efficiency, availability and use of data or other student/teacher/school outcomes, and cost effectiveness and efficiency that can be observed and measured. Many of these outcome measures can also be viewed as intermediate outcomes—changes in practice that, as demonstrated by other research, are likely to affect other final outcomes.

He also points out that quality of implementation and the nature of the comparison group can be the deciding factors in whether or not a program is found to be effective.

SIIA notes that in education there is seldom a pure control condition such as can be achieved in a medical trial with a placebo or sugar pill. Evaluations of education products and services resemble comparative effectiveness trials in which a new medication is tested against a currently approved one to determine whether it is significantly better. The same product may therefore prove effective in one district that currently has a weak program but relatively less effective in another where a strong program is in place. As a result, significant effects can often be difficult to discern.

This point gets to the heart of the contextual issues in any experimental evaluation. Without understanding the local conditions of the experiment the size of the impact for any other context cannot be anticipated. Some experimentalists would argue that a massive multi-site trial would allow averaging across many contextual variations. But such “on average” results won’t necessarily help the decision-maker working in specific local conditions. Thus, taking previous results as a rough indication that an innovation is worth trying is the first step before conducting the grant-funded evaluation of a new variation of the innovation under new conditions.

Jon Baron, writing for the Coalition for Evidence Based Policy expresses a fundamental concern about what counts as evidence. Jon, who is a former Chair of the National Board for Education Sciences and has been a prominent advocate for basing policy on rigorous research, suggests that

“the definition of ‘strong evidence of effectiveness’ in §77.1 incorporate the Investing in Innovation Fund’s (i3) requirement for effects that are ‘substantial and important’ and not just statistically significant.”

He cites examples where researchers have reported statistically significant results, which were based on trivial outcomes or had impacts so small as to have no practical value. Including “substantial and important” as additional criteria also captures the SIIA’s point that it is not sufficient to consider the internal validity of the study—policy makers must consider whether the measure used is an important one or whether the treatment-control contrast allows for detecting a substantial impact.

Addressing the substance and importance of the results gets us appropriately into questions of external validity, and leads us to questions about subgroup impact, where, for example, an innovation has a positive impact “on average” and works well for high scoring students but provides no value for low scoring students. We would argue that a positive average impact is not the most important part of the picture if the end result is an increase in a policy-relevant achievement gap. Should ED be providing grants for innovations where there has been a substantial indication that a gap is worsened? Probably yes, but only if the proposed development is aimed at fixing the malfunctioning innovation and if the program evaluation can address this differential impact.

2013-03-17

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31

Oklahoma Implements Empirical’s Observation Engine for Certification of Classroom Observers

Tulsa Public Schools, the Cooperative Council for Oklahoma Administration, and Empirical Education Inc. just announced the launch of Observation Engine to implement the Teacher and Leader Effectiveness program in the state of Oklahoma. Tulsa Public Schools has purchased Empirical Education’s Observation Engine, an online certification and calibration tool for measuring the reliability of administrators assigned to conduct classroom observations. Tulsa Public Schools developed the Tulsa Model for Observation and Evaluation, a framework for ensuring teaching effectiveness performance, as well as best practices for creating an environment for successful learning and student achievement. Nearly 500 school districts in the state are piloting the Tulsa Model evaluation system this year.

In order to support the dissemination of the Tulsa Model, the Cooperative Council for Oklahoma Administration (CCOSA) is training and administering calibration tests throughout the state to assess and certify the individuals who evaluate the state’s teachers. The Tulsa Model is embedded in Observation Engine to deliver an efficient online system for state-wide use by Oklahoma certified classroom observers. Observation Engine is allowing CCOSA to test approximately 2,000 observers over a span of two weeks.

Observation Engine was developed as part of The Bill and Melinda Gates Foundation’s Measures of Effective Teaching project in which Empirical Education has participated as a research partner conducting R&D on validity and reliability of observational measures. The web-based software was built by Empirical Education, which hosts and supports it for school systems nationwide.

For more details on these events, see the press announcement and our case study.

2012-10-10

Study of Alabama STEM Initiative Finds Positive Impacts

On February 21, 2012 the U.S. Department of Education released the final report of an experiment that Empirical Education has been working on for the last six years. The report, titled Evaluation of the Effectiveness of the Alabama Math, Science, and Technology Initiative (AMSTI) is now available on the Institute of Education Sciences website. The Alabama State Department of Education held a press conference to announce the findings, attended by Superintendent of Education Bice, staff of AMSTI, along with educators, students, and co-principal investigator of the study, Denis Newman, CEO of Empirical Education. The press release issued by the Alabama State Department of Education and a WebEx presentation provide more detail on the study’s findings.

AMSTI was developed by the state of Alabama and introduced in 2002 with the goal of improving mathematics and science achievement in the state’s K-12 schools. Empirical Education was primarily responsible for conducting the study—including the design, data collection, analysis, and reporting—under its subcontract with the Regional Education Lab, Southeast (the study was initiated through a research grant to Empirical). Researchers from Academy of Education Development, Abt Associates, and ANALYTICA made important contributions to design, analysis and data collection.

The findings show that after one year, students in the 41 AMSTI schools experienced an impact on mathematics achievement equivalent to 28 days of additional student progress over students receiving conventional mathematics instruction. The study found, after one year, no difference for science achievement. It also found that AMSTI had an impact on teachers’ active learning classroom practices in math and science that, according to the theory of action posited by AMSTI, should have an impact on achievement. Further exploratory analysis found effects for student achievement in both mathematics and science after two years. The study also explored reading achievement, where it found significant differences between the AMSTI and control groups after one year. Exploration of differential effect for student demographic categories found consistent results for gender, socio-economic status, and pretest achievement level for math and science. For reading, however, the breakdown by student ethnicity suggests a differential benefit.

Just about everybody at Empirical worked on this project at one point or another. Besides the three of us (Newman, Jaciw and Zacamy) who are listed among the authors, we want to acknowledge past and current employees whose efforts made the project possible: Jessica Cabalo, Ruthie Chang, Zach Chin, Huan Cung, Dan Ho, Akiko Lipton, Boya Ma, Robin Means, Gloria Miller, Bob Smith, Laurel Sterling, Qingfeng Zhao, Xiaohui Zheng, and Margit Zsolnay.

With solid cooperation of the state’s Department of Education and the AMSTI team, approximately 780 teachers and 30,000 upper-elementary and middle school students in 82 schools from five regions in Alabama participated in the study. The schools were randomized into one of two categories: 1) Those who received AMSTI starting the first year, or 2) Those who received “business as usual” the first year and began participation in AMSTI the second year. With only a one-year delay before the control group entered treatment, the two-year impact was estimated using statistical techniques developed by, and with the assistance of our colleagues at Abt Associates. Academy for Education Development assisted with data collection and analysis of training and program implementation.

Findings of the AMSTI study will also be presented at the Society for Research on Educational Effectiveness (SREE) Spring Conference taking place in Washington D.C. from March 8-10, 2012. Join Denis Newman, Andrew Jaciw, and Boya Ma on Friday March 9, 2012 from 3:00pm-4:30pm, when they will present findings of their study titled, “Locating Differential Effectiveness of a STEM Initiative through Exploration of Moderators.” A symposium on the study, including the major study collaborators, will be presented at the annual conference of the American Educational Research Association (AERA) on April 15, 2012 from 2:15pm-3:45pm at the Marriott Pinnacle ⁄ Pinnacle III in Vancouver, Canada. This session will be chaired by Ludy van Broekhuizen (director of REL-SE) and will include presentations by Steve Ricks (director of AMSTI); Jean Scott (SERVE Center at UNCG); Denis Newman, Andrew Jaciw, Boya Ma, and Jenna Zacamy (Empirical Education); Steve Bell (Abt Associates); and Laura Gould (formerly of AED). Sean Reardon (Stanford) will serve as the discussant. A synopsis of the study will also be included in the Common Guidelines for Education Research and Development.

2012-02-21

Final Report Released on The Efficacy of PCI’s Reading Program

Empirical has released the final report of a three-year longitudinal study on the efficacy of the PCI Reading Program, which can be found on our reports page. This study, the first formal assessment of the PCI Reading Program, evaluated the program among a sample of third- through eighth-grade students with supported-level disabilities in Florida’s Brevard Public Schools and Miami-Dade County Public Schools. The primary goal of the study was to identify whether the program could achieve its intended purpose of teaching specific sight words. The study was completed in three “phases,” or school years. The results from Phase 1 and 2 showed a significant positive effect on student sight word achievement and Phase 2 supported the initial expectation that two years of growth would be greater than one year (read more on results of Phase 1 and Phase 2).

“Working with Empirical Education was a win for us on many fronts. Their research was of the highest quality and has really helped us communicate with our customers through their several reports and conference presentations. They went beyond just outcomes to show how teachers put our reading program to use in classrooms. In all their dealings with PCI and with the school systems they were highly professional and we look forward to future research partnership opportunities.” - Lee Wilson, President & CEO, PCI Educational Publishing

In Phase 3, the remaining sample of students was too small to conduct any impact analyses, so researchers investigated patterns in students’ progress through the program. The general findings were positive in that the exploration confirmed that students continue to learn more sight words with a second year of exposure to PCI although at a slower pace than expected by the developers. Furthermore, findings across all three phases show high levels of teacher satisfaction with the program. Along with this positive outcome, teacher-reported student engagement levels were also high.

2011-12-09

New Reports Show Positive Results for Elementary Reading Program

Two studies of the Treasures reading program from McGraw-Hill are now posted on our reports page. Treasures is a basal reading program for students in grades K–6. Although the first study was a multi-site study while the second was conducted in the Osceola school district, both found positive impacts on reading achievement in grades 3–5.

The primary data for the first study were scores supplied with district permission by Northwest Evaluation Association from their MAP reading test. The study uses a quasi-experimental comparison group design based on 35 Treasures and 48 comparison schools primarily in the midwest. The study found that Treasures had a positive impact on overall elementary student reading scores, the strongest effect being observed for grade 5.

The second study’s data were provided by the Osceola school district and consist of demographic information, FCAT test scores, and information on student transfers during the year (between schools within the districts and from other districts). The dataset for this time series design covered five consecutive school years from 2005–06 to 2009–10, including two years prior to introduction of the intervention and three years after the introduction. The study included exploration of moderators that demonstrated a stronger positive effect for students with disabilities and English learners than the rest of the student population. We also found a stronger positive impact on girls than on boys.

Check back for results from follow-up studies, which are currently underway in other states and districts.

2011-09-21
Archive