blog posts and news stories

Empirical’s Impact as a Service Providing Insight to EdTech Companies

Education innovators and entrepreneurs have been receiving a boost of support from private equity investors. Currently, ASU GSV is holding their 2016 Summit to support new businesses whose goals are to make a difference in education. Reach Newschools Capital (Reach) is one such organization providing early stage funding, as well as business acumen to entrepreneurs who are trying to solve the most challenging issues…and often with the most challenged populations, in K-12 education. Through Empirical Education, Reach is providing research services by examining the demographic impact of the constituents these education innovators hope to serve. By examining company data from 20 of Reach’s portfolio companies, Empirical provides reports and easy-to-read graphs comparing customer demographic information to national average estimates.

The reports have been well received in gleaning the kind of information companies need to stay on mission…economically, through goods and services, and as social impact.

“The Edtech industry is trying to change the perception that the latest and greatest technologies are only reaching the wealthiest students with the most resources. These reports are disproving this claim, showing that there are a large number of low-income, minority students utilizing these products.” said Aly Sharp, Product Manager for Empirical Education.


Understanding Logic Models Workshop Series

On July 17, Empirical Education facilitated the first of two workshops for practitioners in New Mexico on the development of program logic models, one of the first steps in developing a research agenda. The workshop, entitled “Identifying Essential Logic Model Components, Definitions, and Formats”, introduced the general concepts, purposes, and uses of program logic models to members of the Regional Education Lab (REL) Southwest’s New Mexico Achievement Gap Research Alliance. Throughout the workshop, participants collaborated with facilitators to build a logic model for a program or policy that participants are working on or that is of interest.

Empirical Education is part of the REL Southwest team, which assists Arkansas, Louisiana, New Mexico, Oklahoma, and Texas in using data and research evidence to address high-priority regional needs, including charter school effectiveness, early childhood education, Hispanic achievement in STEM, rural school performance, and closing the achievement gap, through six research alliances. The logic model workshops aim to strengthen the technical capacity of New Mexico Achievement Gap Research Alliance members to understand and visually represent their programs’ theories of change, identify key program components and outcomes, and use logic models to develop research questions. Both workshops are being held in Albuquerque, New Mexico.


Study Shows a “Singapore Math” Curriculum Can Improve Student Problem Solving Skills

A study of HMH Math in Focus (MIF) released today by research firm Empirical Education Inc. demonstrates a positive impact of the curriculum on Clark County School District elementary students’ math problem solving skills. The 2011-2012 study was contracted by the publisher, which left the design, conduct, and reporting to Empirical. MIF provides elementary math instruction based on the pedagogical approach used in Singapore. The MIF approach to instruction is designed to support conceptual understanding, and is said to be closely aligned with the Common Core State Standards (CCSS), which focuses more on in-depth learning than previous math standards.

Empirical found an increase in math problem solving among students taught with HMH Math in Focus compared to their peers. The Clark County School District teachers also reported an increase in their students’ conceptual understanding, as well as an increase in student confidence and engagement while explaining and solving math problems. The study addressed the difference between the CCSS-oriented MIF and the existing Nevada math standards and content. While MIF students performed comparatively better on complex problem solving skills, researchers found that students in the MIF group performed no better than the students in the control group on the measure of math procedures and computation skills. There was also no significant difference between the groups on the state CRT assessment, which has not fully shifted over to the CCSS.

The research used a group randomized control trial to examine the performance of students in grades 3-5 during the 2011-2012 school year. Each grade-level team was randomly assigned to either the treatment group that used MIF or the control group that used the conventional math curriculum. Researchers used three different assessments to capture math achievement contrasting procedural and problem solving skills. Additionally, the research design employed teacher survey data to conduct mediator analyses (correlations between percentage of math standards covered and student math achievement) and assess fidelity of classroom implementation.

You can download the report and research summary from the study using the links below.
Math in Focus research report
Math in Focus research summary


Can We Measure the Measures of Teaching Effectiveness?

Teacher evaluation has become the hot topic in education. State and local agencies are quickly implementing new programs spurred by federal initiatives and evidence that teacher effectiveness is a major contributor to student growth. The Chicago teachers’ strike brought out the deep divisions over the issue of evaluations. There, the focus was on the use of student achievement gains, or value-added. But the other side of evaluation—systematic classroom observations by administrators—is also raising interest. Teaching is a very complex skill, and the development of frameworks for describing and measuring its interlocking elements is an area of active and pressing research. The movement toward using observations as part of teacher evaluation is not without controversy. A recent OpEd in Education Week by Mike Schmoker criticizes the rapid implementation of what he considers overly complex evaluation templates “without any solid evidence that it promotes better teaching.”

There are researchers engaged in the careful study of evaluation systems, including the combination of value-added and observations. The Bill and Melinda Gates Foundation has funded a large team of researchers through its Measures of Effective Teaching (MET) project, which has already produced an array of reports for both academic and practitioner audiences (with more to come). But research can be ponderous, especially when the question is whether such systems can impact teacher effectiveness. A year ago, the Institute of Education Sciences (IES) awarded an $18 million contract to AIR to conduct a randomized experiment to measure the impact of a teacher and leader evaluation system on student achievement, classroom practices, and teacher and principal mobility. The experiment is scheduled to start this school year and results will likely start appearing by 2015. However, at the current rate of implementation by education agencies, most programs will be in full swing by then.

Empirical Education is currently involved in teacher evaluation through Observation Engine: our web-based tool that helps administrators make more reliable observations. See our story about our work with Tulsa Public Schools. This tool, along with our R&D on protocol validation, was initiated as part of the MET project. In our view, the complexity and time-consuming aspects of many of the observation systems that Schmoker criticizes arise from their intended use as supports for professional development. The initial motivation for developing observation frameworks was to provide better feedback and professional development for teachers. Their complexity is driven by the goal of providing detailed, specific feedback. Such systems can become cumbersome when applied to the goal of providing a single score for every teacher representing teaching quality that can be used administratively, for example, for personnel decisions. We suspect that a more streamlined and less labor-intensive evaluation approach could be used to identify the teachers in need of coaching and professional development. That subset of teachers would then receive the more resource-intensive evaluation and training services such as complex, detailed scales, interviews, and coaching sessions.

The other question Schmoker raises is: do these evaluation systems promote better teaching? While waiting for the IES study to be reported, some things can be done. First, look at correlations of the components of the observation rubrics with other measures of teaching such as value-added to student achievement (VAM) scores or student surveys. The idea is to see whether the behaviors valued and promoted by the rubrics are associated with improved achievement. The videos and data collected by the MET project are the basis for tools to do this (see earlier story on our Validation Engine.) But school systems can conduct the same analysis using their own student and teacher data. Second, use quasi-experimental methods to look at the changes in achievement related to the system’s local implementation of evaluation systems. In both cases, many school systems are already collecting very detailed data that can be used to test the validity and effectiveness of their locally adopted approaches.


The Value of Looking at Local Results

The report we released today has an interesting history that shows the value of looking beyond the initial results of an experiment. Later this week, we are presenting a paper at AERA entitled “In School Settings, Are All RCTs Exploratory?” The findings we report from our experiment with an iPad application were part of the inspiration for this. If Riverside Unified had not looked at its own data, we would not, in the normal course of data analysis, have broken the results out by individual districts, and our conclusion would have been that there was no discernible impact of the app. We can cite many other cases where looking at subgroups leads us to conclusions different from the conclusion based on the result averaged across the whole sample. Our report on AMSTI is another case we will cite in our AERA paper.

We agree with the Institute of Education Sciences (IES) in taking a disciplined approach in requiring that researchers “call their shots” by naming the small number of outcomes considered most important in any experiment. All other questions are fine to look at but fall into the category of exploratory work. What we want to guard against, however, is the implication that answers to primary questions, which often are concerned with average impacts for the study sample as a whole, must apply to various subgroups within the sample, and therefore can be broadly generalized by practitioners, developers, and policy makers.

If we find an average impact but in exploratory analysis discover plausible, policy-relevant, and statistically strong differential effects for subgroups, then some doubt about completeness may be cast on the value of the confirmatory finding. We may not be certain of a moderator effect—for example—but once it comes to light, the value of the average impact can also be considered incomplete or misleading for practical purposes. If it is necessary to conduct an additional experiment to verify a differential subgroup impact, the same experiment may verify that the average impact is not what practitioners, developers, and policy makers should be concerned with.

In our paper at AERA, we are proposing that any result from a school-based experiment should be treated as provisional by practitioners, developers, and policy makers. The results of RCTs can be very useful, but the challenges of generalizability of the results from even the most stringently designed experiment mean that the results should be considered the basis for a hypothesis that the intervention may work under similar conditions. For a developer considering how to improve an intervention, the specific conditions under which it appeared to work or not work is the critical information to have. For a school system decision maker, the most useful pieces of information are insight into subpopulations that appear to benefit and conditions that are favorable for implementation. For those concerned with educational policy, it is often the case that conditions and interventions change and develop more rapidly than research studies can be conducted. Using available evidence may mean digging through studies that have confirmatory results in contexts similar or different from their own and examining exploratory analyses that provide useful hints as to the most productive steps to take next. The practitioner in this case is in a similar position to the researcher considering the design of the next experiment. The practitioner also has to come to a hypothesis about how things work as the basis for action.


Exploration in the World of Experimental Evaluation

Our 300+ page report makes a good start. But IES, faced with limited time and resources to complete the many experiments being conducted within the Regional Education Lab system, put strict limits on the number of exploratory analyses researchers could conduct. We usually think of exploratory work as questions to follow up on puzzling or unanticipated results. However, in the case of the REL experiments, IES asked researchers to focus on a narrow set of “confirmatory” results and anything else was considered “exploratory,” even if the question was included in the original research design.

The strict IES criteria were based on the principle that when a researcher is using tests of statistical significance, the probability of erroneously concluding that there is an impact when there isn’t one increases with the frequency of the tests. In our evaluation of AMSTI, we limited ourselves to only four such “confirmatory” (i.e., not exploratory) tests of statistical significance. These were used to assess whether there was an effect on student outcomes for math problem-solving and for science, and the amount of time teachers spent on “active learning” practices in math and in science. (Technically, IES considered this two sets of two, since two were the primary student outcomes and two were the intermediate teacher outcomes.) The threshold for significance was made more stringent to keep the probability of falsely concluding that there was a difference for any of the outcomes at 5% (often expressed as p < .05).

While the logic for limiting the number of confirmatory outcomes is based on technical arguments about adjustments for multiple comparisons, the limit on the amount of exploratory work was based more on resource constraints. Researchers are notorious (and we don’t exempt ourselves) for finding more questions in any study than were originally asked. Curiosity-based exploration can indeed go on forever. In the case of our evaluation of AMSTI, however, there were a number of fundamental policy questions that were not answered either by the confirmatory or by the exploratory questions in our report. More research is needed.

Take the confirmatory finding that the program resulted in the equivalent of 28 days of additional math instruction (or technically an impact of 5% of a standard deviation). This is a testament to the hard work and ingenuity of the AMSTI team and the commitment of the school systems. From a state policy perspective, it gives a green light to continuing the initiative’s organic growth. But since all the schools in the experiment applied to join AMSTI, we don’t know what would happen if AMSTI were adopted as the state curriculum requiring schools with less interest to implement it. Our results do not generalize to that situation. Likewise, if another state with different levels of achievement or resources were to consider adopting it, we would say that our study gives good reason to try it but, to quote Lee Cronbach, a methodologist whose ideas increasingly resonate as we translate research into practice: “…positive results obtained with a new procedure for early education in one community warrant another community trying it. But instead of trusting that those results generalize, the next community needs its own local evaluation” (Cronbach, 1975, p. 125).

The explorations we conducted as part of the AMSTI evaluation did not take the usual form of deeper examinations of interesting or unexpected findings uncovered during the planned evaluation. All the reported explorations were questions posed in the original study plan. They were defined as exploratory either because they were considered of secondary interest, such as the outcome for reading, or because they were not a direct causal result of the randomization, such as the results for subgroups of students defined by different demographic categories. Nevertheless, exploration of such differences is important for understanding how and for whom AMSTI works. The overall effect, averaging across subgroups, may mask differences that are of critical importance for policy

Readers interested in the issue of subgroup differences can refer to Table 6.11. Once differences are found in groups defined in terms of individual student characteristics, our real exploration is just beginning. For example, can the difference be accounted for by other characteristics or combinations of characteristics? Is there something that differentiates the classes or schools that different students attend? Such questions begin to probe additional factors that can potentially be addressed in the program or its implementation. In any case, the report just released is not the “final report.” There is still a lot of work necessary to understand how any program of this sort can continue to be improved.


Research Guidelines Re-released to Broader Audience

The updated guidelines for evaluation research were unveiled at the SIIA Ed Tech Business Forum, held in New York City on November 28 - 29. Authored by Empirical’s CEO, Denis Newman, and issued by the Software and Information Industry Association (SIIA), the guidelines seek to provide a standard of best practices for conducting and reporting evaluation studies for educational technologies in order to enhance the quality, credibility, and utility to education decision makers.

Denis introduced the guidelines during the “Meet the authors of SIIA Publications” session on November 29. Non-members will be able to purchase the guidelines from Selling to Schools starting Thursday, December 1, 2011 (with continued free access to SIIA members). UPDATE: Denis was interviewed by Glen McCandless of Selling to Schools on December 15, 2011 to discuss key aspects of the guidelines. Listen to the full interview here.


Join Empirical Education at ALAS, AEA, and NSDC

This year, the Association of Latino Administrators & Superintendents (ALAS) will be holding its 8th annual summit on Hispanic Education in San Francisco. Participants will have the opportunity to attend speaker sessions, roundtable discussions, and network with fellow attendees. Denis Newman, CEO of Empirical Education, together with John Sipe, Senior Vice President and National Sales Manager at Houghton Mifflin Harcourt and Jeannetta Mitchell, eight-grade teacher at Presidio Middle school and a participant in the pilot study, will take part in a 30-minute discussion reviewing the study design and experiences gathered around a one-year study of Algebra on the iPad. The session takes place on October 13th at the Salon 8 of the Marriott Marquis in San Francisco from 10:30am to 12:00pm.

Also this year, the American Evaluation Association (AEA) will be hosting its 25th annual conference from November 2–5 in Anaheim, CA. Approximately 2,500 evaluation practitioners, academics, and students from around the globe are expected to gather at the conference. This year’s theme revolves around the challenges of values and valuing in evaluation.

We are excited to be part of AEA again this year and would like to invite you to join us at two presentations. First, Denis Newman will be hosting the roundtable session on Returning to the Causal Explanatory Tradition: Lessons for Increasing the External Validity of Results from Randomized Trials. We examine how the causal explanatory tradition—originating in the writing of Lee Cronbach—can inform the planning, conduct and analysis of randomized trials to increase external validity of findings. Find us in the Balboa A/B room on Friday, November 4th from 10:45am to 11:30am.

Second, Valeriy Lazarev and Denis Newman will present a paper entitled, “From Program Effect to Cost Savings: Valuing the Benefits of Educational Innovation Using Vertically Scaled Test Scores And Instructional Expenditure Data.” Be sure to stop by on Saturday, November 5th from 9:50am to 11:20am in room Avila A.

Furthermore, Jenna Zacamy, Senior Research Manager at Empirical Education, will be presenting on two topics at the National Staff Development Council (NSDC) annual conference taking place in Anaheim, CA from December 3rd to 7th. Join her on Monday, December 5th at 2:30pm to 4:30pm when she will talk about the impact on student achievement for grades 4 through 8 of the Alabama Math, Science, and Technology Initiative, together with Pamela Finney and Jean Scott from SERVE Center at UNCG.

On Tuesday, December 6th at 10:00am to 12:00pm Jenna will discuss prior and current research on the effectiveness of a large-scale high school literacy reform together with Cathleen Kral from WestEd and William Loyd from Washtenaw Intermediate School District.


Comment on the NY Times: In Classroom of Future, Stagnant Scores

The New York Times is running a series of front-page articles on “Grading the Digital School.” The first one ran Labor Day weekend and raised the question as to whether there’s any evidence that would persuade a school board or community to allocate extra funds for technology. With the demise of the Enhancing Education Through Technology (EETT) program, federal funds dedicated to technology will no longer be flowing into states and districts. Technology will have to be measured against any other discretionary purchase. The resulting internal debates within schools and their communities about the expense vs. value of technology promise to have interesting implications and are worth following closely.

The first article by Matt Richtel revisits a debate that has been going on for decades between those who see technology as the key to “21st Century learning” and those who point to the dearth of evidence that technology makes any measurable difference to learning. It’s time to try to reframe this discussion in terms of what can be measured. And in considering what to measure, and in honor of Labor Day, we raise a question that is often ignored: what role do teachers play in generating the measurable value of technology?

Let’s start with the most common argument in favor of technology, even in the absence of test score gains. The idea is that technology teaches skills “needed in a modern economy,” and these are not measured by the test scores used by state and federal accountability systems. Karen Cator, director of the U.S. Department of Education office of educational technology, is quoted as saying (in reference to the lack of improvement in test scores), “…look at all the other things students are doing: learning to use the Internet to research, learning to organize their work, learning to use professional writing tools, learning to collaborate with others.” Presumably, none of these things directly impact test scores. The problem with this perennial argument is that many other things that schools keep track of should provide indicators of improvement. If as a result of technology, students are more excited about learning or more engaged in collaborating, we could look for an improvement in attendance, a decrease in drop-outs, or students signing up for more challenging courses.

Information on student behavioral indicators is becoming easier to obtain since the standardization of state data systems. There are some basic study designs that use comparisons among students within the district or between those in the district and those elsewhere in the state. This approach uses statistical modeling to identify trends and control for demographic differences, but is not beyond the capabilities of many school district research departments1 or the resources available to the technology vendors. (Empirical has conducted research for many of the major technology providers, often focusing on results for a single district interested in obtaining evidence to support local decisions.) Using behavioral or other indicators, a district such as that in the Times article can answer its own questions. Data from the technology systems themselves can be used to identify users and non-users and to confirm the extent of usage and implementation. It is also valuable to examine whether some students (those in most need or those already doing okay) or some teachers (veterans or novices) receive greater benefit from the technology. This information may help the district focus resources where they do the most good.

A final thought about where to look for impacts of technologies comes from a graph of the school district’s budget. While spending on technology and salaries have both declined over the last three years, spending on salaries is still about 25 times as great as on technologies. Any discussion of where to find an impact of technology must consider labor costs, which are the district’s primary investment. We might ask whether a small investment in technology would allow the district to reduce the numbers of teachers by, for example, allowing a small increase in the number of students each teacher can productively handle. Alternatively, we might ask whether technology can make a teacher more effective, by whatever measures of effective teaching the district chooses to use, with their current students. We might ask whether technologies result in keeping young teachers on the job longer or encouraging initiative to take on more challenging assignments.

It may be a mistake to look for a direct impact of technology on test scores (aside from technologies aimed specifically at that goal), but it is also a mistake to assume the impact is, in principle, not measurable. We need a clear picture of how various technologies are expected to work and where we can look for the direct and indirect effects. An important role of technology in the modern economy is providing people with actionable evidence. It would be ironic if education technology was inherently opaque to educational decision makers.

1 Or we would hope, the New York Times. Sadly, the article provides a graph of trends in math and reading for the district highlighted in the story compared to trends for the state. The graphic is meant to show that the district is doing worse than the state average. But the article never suggests that we should consider the population of the particular district and whether it is doing better or worse than one would expect, controlling for demographics, available resources, and other characteristics.


A Conversation About Building State and Local Research Capacity

John Q Easton, director of the Institute of Education Sciences (IES), came to New Orleans recently to participate in the annual meeting of the American Educational Research Association. At one of his stops, he was the featured speaker at a meeting of the Directors of Research and Evaluation (DRE), an organization composed of school district research directors. (DRE is affiliated with AERA and was recently incorporated as a 501©(3)). John started his remarks by pointing out that for much of his career he was a school district research director and felt great affinity to the group. He introduced the directions that IES was taking, especially how it was approaching working with school systems. He spent most of the hour fielding questions and engaging in discussion with the participants. Several interesting points came out of the conversation about roles for the researchers who work for education agencies.

Historically, most IES research grant programs have been aimed at university or other academic researchers. It is noteworthy that even in a program for “Evaluation of State and Local Education Programs and Policies,” grants have been awarded only to universities and large research firms. There is no expectation that researchers working for the state or local agency would be involved in the research beyond the implementation of the program. The RFP for the next generation of Regional Education Labs (REL) contracts may help to change that. The new RFP expects the RELs to work closely with education agencies to define their research questions and to assist alliances of state and local agencies in developing their own research capacity.

Members of the audience noted that, as district directors of research, they often spend more time reviewing research proposals from students and professors at local colleges who want to conduct research in their schools, rather than actually answering questions initiated by the district. Funded researchers treat the districts as the “human subjects,” paying incentives to participants and sometimes paying for data services. But the districts seldom participate in defining the research topic, conducting the studies, or benefiting directly from the reported findings. The new mission of the RELs to build local capacity will be a major shift.

Some in the audience pointed out reasons to be skeptical that this REL agenda would be possible. How can we build capacity if research and evaluation departments across the country are being cut? In fact, very little is known about the number of state or district practitioners whose capacity for research and evaluation could be built by applying the REL resources. (Perhaps, a good first research task for the RELs would be to conduct a national survey to measure the existing capacity.)

John made a good point in reply: IES and the RELs have to work with the district leadership—not just the R&E departments—to make this work. The leadership has to have a more analytic view. They need to see the value of having an R&E department that goes beyond test administration, and is able to obtain evidence to support local decisions. By cultivating a research culture in the district, evaluation could be routinely built in to new program implementations from the beginning. The value of the research would be demonstrated in the improvements resulting from informed decisions. Without a district leadership team that values research to find out what works for the district, internal R&E departments will not be seen as an important capacity.

Some in the audience pointed out that in parallel to building a research culture in districts, it will be necessary to build a practitioner culture among researchers. It would be straightforward for IES to require that research grantees and contractors engage the district R&E staff in the actual work, not just review the research plan and sign the FERPA agreement. Practitioners ultimately hold the expertise in how the programs and research can be implemented successfully in the district, thus improving the overall quality and relevance of the research.