blog posts and news stories

IES Publishes our Recent REL Southwest Teacher Studies

The U.S. Department of Education’s Institute of Education Sciences published two reports of studies we conducted for REL Southwest! We are thankful for the support and engagement we received from the Educator Effectiveness Research Alliance and the Oklahoma Rural Schools Research Alliance throughout the studies. The collaboration with the research alliances and educators aligns well with what we set out to do in our core mission: to support K-12 systems and empower educators in making evidence-based decisions.

The first study was published earlier this month and identified factors associated with successful recruitment and retention of teachers in Oklahoma rural school districts, in order to highlight potential strategies to address Oklahoma’s teaching shortage. This correlational study covered a 10-year period (the 2005-06 to 2014-15 school years) and used data from the Oklahoma State Department of Education, the Oklahoma Office of Educational Quality and Accountability, federal non-education sources, and publicly available geographic information systems from Google Maps. The study found that teachers who are male, those who have higher postsecondary degrees, and those who have more teaching experience are harder than others to recruit and retain in Oklahoma schools. In addition, for teachers in rural districts, higher total compensation and increased responsibilities in job assignment are positively associated with successful recruitment and retention. In order to provide context, the study also examined patterns of teacher job mobility between rural and non-rural school districts. The rate of teachers in Oklahoma rural schools reaching tenure is slightly lower than the rates for teachers in non-rural areas. Also, rural school districts in Oklahoma had consistently lower rates of success in recruiting teachers than non-rural school districts from 2006-07 to 2011-12.

This most recent study, published last week, examined data from the 2014-15 pilot implementation of the Texas Teacher Evaluation and Support System (T-TESS). In 2014-15 the Texas Education Agency piloted the T-TESS in 57 school districts. During the pilot year teacher overall ratings were based solely on rubric ratings on 16 dimensions across four domains.

The study examined the statistical properties of the T-TESS rubric to explore the extent to which it differentiates teachers on teaching quality and to investigate its internal consistency and efficiency. It also explored whether certain types of schools have teachers with higher or lower ratings. Using data from the pilot for more than 8,000 teachers, the study found that the rubric differentiates teacher effectiveness at the overall, domain, and dimension levels; domain and dimension ratings on the observation rubric are internally consistent; and the observation rubric is efficient, with each dimension making a unique contribution to a teacher’s overall rating. In addition, findings indicated that T-TESS rubric ratings varied slightly in relation to some school characteristics that were examined, such as socioeconomic status and percentage of English Language Learners. However, there is little indication that these characteristics introduced bias in the evaluators’ ratings.

2017-10-30

Sure, the edtech product is proven to work, but will it work in my district?

It’s a scenario not uncommon in your district administrators’ office. They’ve received sales pitches and demos of a slew of new education technology (edtech) products, each one accompanied with “evidence” of its general benefits for teachers and students. But underlying the administrator’s decision is a question often left unanswered: Will this work in our district?

In the conventional approach to research advocated, for example, by the U.S. Department of Education and the Every Student Succeeds Act (ESSA), the finding that is reported and used in the review of products is the overall average impact for any and all subgroups of students, teachers, or schools in the study sample. In our own research, we have repeatedly seen that who it works for and under what conditions can be more important than the average impact. There are products that are effective on average but don’t work for an important subgroup of students, or vice versa, work for some students but not all. Some examples:

  • A math product, while found to be effective overall, was effective for white students but ineffective for minority students. This effect would be relevant to any district wanting to close (rather than further widen) an achievement gap.
  • A product that did well on average performed very well in elementary grades but poorly in middle school. This has obvious relevance for a district, as well as for the provider who may modify its marketing target.
  • A teacher PD product greatly benefitted uncertified teachers but didn’t help the veteran teachers do any better than their peers using the conventional textbook. This product may be useful for new teachers but a poor choice for others.

As a research organization, we have been looking at ways to efficiently answer these kinds of questions for products. Especially now, with the evidence requirements built into ESSA, school leaders can ask the edtech salesperson: “Does your product have evidence that ESSA calls for?” They may well hear an affirmative answer supported by an executive summary of a recent study. But, there’s a fundamental problem with what ESSA is asking for. ESSA doesn’t ask for evidence that the product is likely to work in your specific district. This is not the fault of ESSA’s drafters. The problem is built into the conventional design of research on “what works”. The U.S. Department of Education’s What Works Clearinghouse (WWC) bases its evidence rating only on an average; if there are different results for different subgroups of students, that difference is not part of the rating. Since ESSA adopts the WWC approach, that’s the law of the land. Hence, your district’s most pressing question is left unanswered: will this work for a district like mine?

Recently, the Software & Information Industry Association, the primary trade association of the software industry, released a set of guidelines for research explaining to its member companies the importance of working with districts to conduct research that will meet the ESSA standards. As the lead author of this report, I can say it was our goal to foster an improved dialog between the schools and the providers about the evidence that should be available to support buying these products. As an addendum to the guidelines aimed at arming educators with ways to look at the evidence and questions to ask the edtech salesperson, here are three suggestions:

  1. It is better to have some information than no information. The fact that there’s research that found the product worked somewhere gives you a working hypothesis that it could be a better than average bet to try out in your district. In this respect, you can consider the WWC and newer sites such as Evidence for ESSA rating of the study as a screening tool—they will point you to valid studies about the product you’re interested in. But you should treat previous research as a working hypothesis rather than proof.
  2. Look at where the research evidence was collected. You’ll want to know whether the research sites and populations in the study resemble your local conditions. WWC has gone to considerable effort to code the research by the population in the study and provides a search tool so you can find studies conducted in districts like yours. And if you download and read the original report, it may tell you whether it will help reduce or increase an achievement gap of concern.
  3. Make a deal with the salesperson. In exchange for your help in organizing a pilot and allowing them to analyze your data, you get the product for a year at a steep discount and a good ongoing price if you decide to implement the product on a full scale. While you’re unlikely to get results from a pilot (e.g., based on spring testing) in time to support a decision, you can at least lower your cost for the materials, and you’ll help provide a neighboring district (with similar populations and conditions) with useful evidence to support a strong working hypothesis as to whether it is likely to work for them as well.
2017-10-15

Determining the Impact of CREATE on Math and ELA Achievement

Empirical Education is conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) under an Investing in Innovation (i3) development grant awarded in 2014. The CREATE evaluation takes place in schools throughout the state of Georgia.

Approximately 40 residents from the Georgia State University (GSU) College of Education (COE) are participating in the CREATE teacher residency program. Using a quasi-experimental design, outcomes for these teachers and their students will be compared to those from a matched comparison group of close to 100 teachers who simultaneously enrolled in GSU COE but did not participate in CREATE. Implementation for cohort 1 started in 2015, and cohort 2 started in 2016. Confirmatory outcomes will be assessed in years 2 and 3 of both cohorts (2017 - 2019).

Confirmatory research questions we will be answering include:

What is the impact of one-year of exposure of students to a novice teacher in their second year of teacher residency in the CREATE program, compared to the Business as Usual GSU teacher credential program, on mathematics and ELA achievement of students in grades 4-8, as measured by the Georgia Milestones Assessment System?

What is the impact of CREATE on the quality of instructional strategies used by teachers, as measured by the Teacher Assessment of Performance Standards (TAPS) scores, at the end of the third year of residency, relative to the business as usual condition?

What is the impact of CREATE on the quality of the learning environment created by teachers, as measured by Teacher Assessment of Performance Standards (TAPS) scores, at the end of the third year of residency, relative to the business as usual condition?

Exploratory research questions will address additional teacher-level outcomes including retention, effectiveness, satisfaction, collaboration, and levels of stress in relationships with students and colleagues.

We plan to publish the results of this study in fall of 2019. Please check back to read the research summary and report.

2017-06-06

IES Releases New Empirical Education Report on Educator Effectiveness

Our report just released by IES examines the statistical properties of Arizona’s new multiple-measure teacher evaluation model. The study used data from the pilot in 2012-13 to explore the relationships among the system’s component measures (teacher observations, student academic progress, and stakeholder surveys). It also investigated how well the model differentiated between higher and lower performing teachers. Findings suggest that the model’s observation measure may be improved through further evaluator training and calibration, and that a single aggregated composite score may not adequately represent independent aspects of teacher performance.

The study was carried out in partnership with the Arizona Department of Education as part of our work with the Regional Education Laboratory (REL) West’s Educator Effectiveness Alliance, which includes Arizona, Utah, and Nevada Department of Education officials, as well as teacher union representatives, district administrators, and policymakers. While the analysis is specific to Arizona’s model, the study findings and methodology may be of interest to other state education agencies that are developing of implementing new multiple-measure evaluation systems. We have continued this work with additional analyses for alliance members and plan to provide additional capacity building during 2015.

2014-12-16

U.S. Department of Education Could Expand its Concept of Student Growth

The continuing debate about the use of student test scores as a part of teacher evaluation misses an essential point. A teacher’s influence on a student’s achievement does not end in spring when the student takes the state test (or is evaluated using any of the Student Learning Objectives methods). An inspiring teacher, or one that makes a student feel recognized, or one that digs a bit deeper into the subject matter, may be part of the reason that the student later graduates high school, gets into college, or pursues a STEM career. These are “student achievements,” but they are ones that show up years after a teacher had the student in her class. As a teacher is getting students to grapple with a new concept, the students may not demonstrate improvements on standardized tests that year. But the “value-added” by the teacher may show up in later years.

States and districts implementing educator evaluations as part of their NCLB waivers are very aware of the requirement that they must “use multiple valid measures in determining performance levels, including as a significant factor data on student growth …” Student growth is defined as change between points in time in achievement on assessments. Student growth defined in this way obscures a teacher’s contribution to a student’s later school career.

As a practical matter, it may seem obvious that for this year’s evaluation, we can’t use something that happens next year. But recent analyses of longitudinal data, reviewed in an excellent piece by Raudenbush show that it is possible to identify predictors of later student achievement associated with individual teacher practices and effectiveness. The widespread implementation of multiple-measure teacher evaluations is starting to accumulate just the longitudinal datasets needed to do these predictive analyses. On the basis of these analyses we may be able to validate many of the facets of teaching that we have found, in analyses of the MET data, to be unrelated to student growth as defined in the waiver requirements.

Insofar as we can identify, through classroom observations and surveys, practices and dispositions that are predictive of later student achievement such as college going, then we have validated those practices. Ultimately, we may be able to substitute classroom observations and surveys of students, peers, and parents for value-added modeling based on state tests and other ad hoc measures of student growth. We are not yet at that point, but the first step will be to recognize that a teacher’s influence on a student’s growth extends beyond the year she has the student in the class.

2014-08-30

Factor Analysis Shows Facets of Teaching

The Empirical team has illustrated quantitatively what a lot of people have suspected. Basic classroom management, keeping things moving along, and the sense that the teacher is in control are most closely associated with achievement gains. We used teacher evaluation data collected by the Measures of Effective Teaching project to develop a three-factor model and found that only one factor was associated with VAM scores. Two other factors—one associated with constructivist pedagogy and the other with positive attitudes—were unrelated to short-term student outcomes. Val Lazarev and Denis Newman presented this work at the Association for Education Finance and Policy Annual Conference on March 13, 2014. And on May 7, Denis Newman and Kristen Koue conducted a workshop on the topic at the CCSSO’s SCEE Summit. The workshop emphasized the way that factors not directly associated with spring test scores can be very important in personnel decisions. The validation of these other factors may require connections to student achievements such as staying in school, getting into college, or pursuing a STEM career in years after the teacher’s direct contact.

2014-05-09

Does 1 teacher = 1 number? Some Questions About the Research on Composite Measures of Teacher Effectiveness

We are all familiar with approaches to combining student growth metrics and other measures to generate a single measure that can be used to rate teachers for the purpose of personnel decisions. For example, as an alternative to using seniority as the basis for reducing the workforce, a school system may want to base such decisions—at least in part—on a ranking based on a number of measures of teacher effectiveness. One of the reports released January 8 by the Measures of Effective Teaching (MET) addressed approaches to creating a composite (i.e., a single number that averages various aspects of teacher performance) from multiple measures such as value-added modeling (VAM) scores, student surveys, and classroom observations. Working with the thousands of data points in the MET longitudinal database, the researchers were able to try out multiple statistical approaches to combining measures. The important recommendation from this research for practitioners is that, while there is no single best way to weight the various measures that are combined in the composite, balancing the weights more evenly tends to increase reliability.

While acknowledging the value of these analyses, we want to take a step back in this commentary. Here we ask whether agencies may sometimes be jumping to the conclusion that a composite is necessary when the individual measures (and even the components of these measures) may have greater utility than the composite for many purposes.

The basic premise behind creating a composite measure is the idea that there is an underlying characteristic that the composite can more or less accurately reflect. The criterion for a good composite is the extent to which the result accurately identifies a stable characteristic of the teacher’s effectiveness.

A problem with this basic premise is that in focusing on the common factor, the aspects of each measure that are unrelated to the common factor get left out—treated as noise in the statistical equation. But, what if observations and student surveys measure things that are unrelated to what the teacher’s students are able to achieve in a single year under her tutelage (the basis for a VAM score)? What if there are distinct domains of teacher expertise that have little relation to VAM scores? By definition, the multifaceted nature of teaching gets reduced to a single value in the composite.

This single value does have a use in decisions that require an unequivocal ranking of teachers, such as some personnel decisions. For most purposes, however, a multifaceted set of measures would be more useful. The single measure has little value for directing professional development, whereas the detailed output of the observation protocols are designed for just that. Consider a principal deciding which teachers to assign as mentors, or a district administrator deciding which teachers to move toward a principalship. Might it be useful, in such cases, to have several characteristics to represent different dimensions of abilities relevant to success in the particular roles?

Instead of collapsing the multitude of data points from achievement, surveys, and observations, consider an approach that makes maximum use of the data points to identify several distinct characteristics. In the usual method for constructing a composite (and in the MET research), the results for each measure (e.g., the survey or observation protocol) are first collapsed into a single number, and then these values are combined into the composite. This approach already obscures a large amount of information. The Tripod student survey provides scores on the seven Cs; an observation framework may have a dozen characteristics; and even VAM scores, usually thought of as a summary number, can be broken down (with some statistical limitations) into success with low-scoring vs. with high-scoring students (or any other demographic category of interest). Analyzing dozens of these data points for each teacher can potentially identify several distinct facets of a teacher’s overall ability. Not all facets will be strongly correlated with VAM scores but may be related to the teacher’s ability to inspire students in subsequent years to take more challenging courses, stay in school, and engage parents in ways that show up years later.

Creating a single composite measure of teaching has value for a range of administrative decisions. However, the mass of teacher data now being collected are only beginning to be tapped for improving teaching and developing schools as learning organizations.

2013-02-14

Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.

2012-10-31

Need for Product Evaluations Continues to Grow

There is a growing need for evidence of the effectiveness of products and services being sold to schools. A new release of SIIA’s product evaluation guidelines is now available at the Selling to Schools website (with continued free access to SIIA members), to help guide publishers in measuring the effectiveness of the tools they are selling to schools.

It’s been almost a decade since NCLB made its call for “scientifically-based research,” but the calls for research haven’t faded away. This is because resources available to schools have diminished over that time, heightening the importance of cost benefit trade-offs in spending.

NCLB has focused attention on test score achievement, and this metric is becoming more pervasive; e.g., through a tie to teacher evaluation and through linkages to dropout risk. While NCLB fostered a compliance mentality—product specs had to have a check mark next to SBR—the need to assure that funds are not wasted is now leading to a greater interest in research results. Decision-makers are now very interested in whether specific products will be effective, or how well they have been working, in their districts.

Fortunately, the data available for evaluations of all kinds is getting better and easier to access. The U.S. Department of Education has poured hundreds of millions of dollars into state data systems. These investments make data available to states and drive the cleaning and standardizing of data from districts. At the same time, districts continue to invest in data systems and warehouses. While still not a trivial task, the ability of school district researchers to get the data needed to determine if an investment paid off—in terms of increased student achievement or attendance—has become much easier over the last decade.

The reauthorization of ESEA (i.e., NCLB) is maintaining the pressure to evaluate education products. We are still a long way from the draft reauthorization introduced in Congress becoming a law, but the initial indications are quite favorable to the continued production of product effectiveness evidence. The language has changed somewhat. Look for the phrase “evidence based”. Along with the term “scientifically-valid”, this new language is actually more sophisticated and potentially more effective than the old SBR neologism. Bob Slavin, one of the reviewers of the SIIA guidelines, says in his Ed Week blog that “This is not the squishy ‘based on scientifically-based evidence’ of NCLB. This is the real McCoy.” It is notable that the definition of “evidence-based” goes beyond just setting rules for the design of research, such as the SBR focus on the single dimension of “internal validity” for which randomization gets the top rating. It now asks how generalizable the research is or its “external validity”; i.e., does it have any relevance for decision-makers?

One of the important goals of the SIIA guidelines for product effectiveness research is to improve the credibility of publisher-sponsored research. It is important that educators see it as more than just “market research” producing biased results. In this era of reduced budgets, schools need to have tangible evidence of the value of products they buy. By following the SIIA’s guidelines, publishers will find it easier to achieve that credibility.

2011-11-12

Empirical Education Develops Web-Based Tool to Improve Teacher Evaluation

For school districts looking for ways to improve teacher observation methods, Empirical Education has begun development of a web-delivered tool that will provide a convenient way to validate their observational protocols and rubrics against measures of the teacher’s contribution to student academic growth.

Empirical Education is charged with developing a “validation engine” as part of the Measures of Teacher Effectiveness (MET) project, funded by the Bill and Melinda Gates Foundation. As described on the project’s website, the tool will allow users to “view classroom observation videos, rate those videos and then receive a report that evaluates the predictive validity and rater consistency for the protocol.” The MET project has collected thousands of hours of video of classrooms as well as records of the characteristics and academic performance associated with the students in the class.

By watching and coding videos of a range of teachers, users will be able to verify whether or not their current teacher rating systems are identifying teaching behavior associated with higher achievement. The tool will allow users to review their own rating systems against a variety of MET project measures, and will give real-time feedback through an automated report generator.

Development of the validation engine builds on two years of MET Project research, which included data from six school districts across the country and over 3,000 teachers. Researchers will now use the data to identify leading indicators of teacher practice on student achievement. The engine is expected to undergo beta testing over the next few months, beginning with the National Math and Science Initiative.

Announcement of the new tool comes as interest in alternative ways to measure the effectiveness of teachers is becoming a major issue in education and as federal, state and local officials and teacher organizations look for researched-based ways to identify effective teachers and improve student outcomes.

“At a time when schools are experiencing budget cuts, it is vital that school districts have ready access to research tools, so that they can make the most informed decisions,” says Denis Newman, President of Empirical Education. The validation engine will be part of a suite of web-based technology tools developed by the company, including [MeasureResults, an online tool that allows districts to evaluate the effectiveness of the products and programs they use.

2010-11-17
Archive