blog posts and news stories

Classrooms and Districts: Breaking Down Silos in Education Research and Evidence

I just got back from Edsurge’s Fusion conference. The theme, aimed at classroom and school leaders, was personalizing classroom instruction. This is guided by learning science, which includes brain development and the impact of trauma, as well as empathetic caregiving, as Pamela Cantor beautifully explained in her keynote. It also leads to detailed characterizations of learner variability being explored at Digital Promise by Vic Vuchic’s team, which is providing teachers with mappings between classroom goals and tools and strategies that can address learners who vary in background, cognitive skills, and socio-emotional character.

One of the conference tracks that particularly interested me was the workshops and discussions under “Research & Evidence”. Here is where I experienced a disconnect between Empirical ’s research policy-oriented work interpreting ESSA and Fusion’s focus on improving the classroom.

  • The Fusion conference is focused at the classroom level, where teachers along with their coaches and school leaders are making decisions about personalizing the instruction to students. They advocate basing decisions on research and evidence from the learning sciences.
  • Our work, also using research and evidence, has been focused on the school district level where decisions are about procurement and implementation of educational materials including the technical infrastructure needed, for example, for edtech products.

While the classroom and district levels have different needs and resources and look to different areas of scientific expertise, they need not form conceptual silos. But the differences need to be understood.

Consider the different ways we look at piloting a new product.

  • The Digital Promise edtech pilot framework attempts to move schools toward a more planful approach by getting them to identify and quantify the problem for which the product being piloted could be a solution. The success in the pilot classrooms is evaluated by the teachers, where detailed understandings by the teacher don’t call for statistical comparisons. Their framework points to tools such as the RCE Coach that can help with the statistics to support local decisions.
  • Our work looks at pilots differently. Pilots are excellent for understanding implementability and classroom acceptance (and working with developers to improve the product), but even with rapid cycle tools, the quantitative outcomes are usually not available in time for local decisions. We are more interested in how data can be accumulated nationally from thousands of pilots so that teachers and administrators can get information on which products are likely to work in their classrooms given their local demographics and resources. This is where review sites like Edsurge product reviews or Noodle’s ProcureK12 could be enhanced with evidence about for whom, and under what conditions, the products work best. With over 5,000 edtech products, an initial filter to help choose what a school should pilot will be necessary.

A framework that puts these two approaches together is promulgated in the Every Student Succeeds Act (ESSA). ESSA defines four levels of evidence, based on the strength of the causal inference about whether the product works. More than just a system for rating the scientific rigor of a study, it is a guide to developing a research program with a basis in learning science. The base level says that the program must have a rationale. This brings us back to the Digital Promise edtech pilot framework needing teachers to define their problem. The ESSA level 1 rationale is what the pilot framework calls for. Schools must start thinking through what the problem is that needs to be solved and why a particular product is likely to be a solution. This base level sets up the communication between educators and developers about not just whether the product works in the classroom, but how to improve it.

The next level in ESSA, called “correlational,” is considered weak evidence, because it shows only that the product has “promise” and is worth studying with a stronger method. However, this level is far more useful as a way for developers to gather information about which parts of the program are driving student results, and which patterns of usage may be detrimental. Schools can see if there is an amount of usage that maximizes the value of the product (rather than depending solely on the developer’s rationale). This level 2 calls for piloting the program and examining quantitative results. To get correlational results, the pilot must have enough students and may require going beyond a single school. This is a reason that we usually look for a district’s involvement in a pilot.

The top two levels in the ESSA scheme involve comparisons of students and teachers who use the product to those who do not. These are the levels where it begins to make sense to combine a number of studies of the same product from different districts in a statistical process called meta-analysis so we can start to make generalizations. At these levels, it is very important to look beyond just the comparison of the program group and the control group and gather information on the characteristics of schools, teachers, and students who benefit most (and least) from the product. This is the evidence of most value to product review sites.

When it comes to characterizing schools, teachers, and students, the “classroom” and the “district” approach have different, but equally important, needs.

  • The learner variability project has very fine-grained categories that teachers are able to establish for the students in their class.
  • For generalizable evidence, we need characteristics that are routinely collected by the schools. To make data analysis for efficacy studies a common occurrence, we have to avoid expensive surveys and testing of students that are used only for the research. Furthermore, the research community must reach consensus on a limited number of variables that will be used in research. Fortunately, another aspect of ESSA is the broadening of routine data collection for accountability purposes, so that information on improvements in socio-emotional learning or school climate will be usable in studies.

Edsurge and Digital Promise are part of a west coast contingent of researchers, funders, policymakers, and edtech developers that has been discussing these issues. We look forward to continuing this conversation within the framework provided by ESSA. When we look at the ESSA levels as not just vertical but building out from concrete classroom experience to more abstract and general results from thousands of school districts, then learning science and efficacy research are combined. This strengthens our ability to serve all students, teachers, and school leaders.

2018-10-08

The Rebel Alliance is Growing

The rebellion against the old NCLB way of doing efficacy research is gaining force. A growing community among edtech developers, funders, researchers, and school users has been meeting in an attempt to reach a consensus on an alternative built on ESSA.

This is being assisted by openness in the directions currently being pursued by IES. In fact, we are moving into a new phase marked by two-way communication with the regime. While the rebellion hasn’t yet handed over its lightsabers, it is encouraged by the level of interest from prominent researchers.

From these ongoing discussions, there have been some radical suggestions inching toward consensus. A basic idea now being questioned is this:

The difference between the average of the treatment group and the average of the control group is a valid measure of effectiveness.

There are two problems with this:

  1. In schools, there’s no “placebo” or something that looks like a useful program but is known to have zero effectiveness. Whatever is going on in the schools, or classes, or with teachers and students in the control condition has some usefulness or effectiveness. The usefulness of the activities in the control classes or schools may be greater than the activities being evaluated in the study, or may be not as useful. The study may find that the “effectiveness” of the activities being studied is positive, negative, or too small to be discerned statistically by the study. In any case, the size (negative or positive) of the effect is determined as much by what’s being done in the control group as the treatment group.
  2. Few educational activities have the same level of usefulness for all teachers and students. Looking at only the average will obscure the differences. For example, we ran a very large study for the U.S. Department of Education of a STEM program where we found, on average, the program was effective. What the department didn’t report was that it only worked for the white kids, not the black kids. The program increased instead of reducing the existing achievement gap. If you are considering adopting this STEM program, the impact on the different subgroups is relevant–a high minority school district may want to avoid it. Also, to make the program better, the developers need to know where it works and where it doesn’t. Again, the average impact is not just meaningless but also can be misleading.

A solution to the overuse of the average difference from studies is to conduct a lot more studies. The price the ED paid for our large study could have paid for 30 studies of the kind we are now conducting in the same state of the same program; in 10% of the time of the original study. If we had 10 different studies for each program, where studies are conducted in different school districts with different populations and levels of resources, the “average” across these studies start to make sense. Importantly, the average across these 10 studies for each of the subgroups will give a valid picture of where, how, and with which students and teachers the program tends to work best. This kind of averaging used in research is called meta-analysis and allows many small differences found across studies to build on the power of each study to generate reliable findings.

If developers or publishers of the products being used in schools took advantage of their hundreds of implementations to gather data, and if schools would be prepared to share student data for this research, we could have researcher findings that both help schools decide what will likely work for them and help developers improve their products.

2018-09-21

For Quasi-experiments on the Efficacy of Edtech Products, it is a Good Idea to Use Usage Data to Identify Who the Users Are

With edtech products, the usage data allows for precise measures of exposure and whether critical elements of the product were implemented. Providers often specify an amount of exposure or the kind of usage that is required to make a difference. Furthermore, educators often want to know whether the program has an effect when implemented as intended. Researchers can readily use data generated by the product (usage metrics) to identify compliant users, or to measure the kind and amount of implementation.

Since researchers generally track product implementation and statistical methods allow for adjustments for implementation differences, it is possible to estimate the impact on successful implementers, or technically, on a subset of study participants who were compliant with treatment. It is, however, very important that the criteria researchers use in setting a threshold be grounded in a model of how the program works. This will, for example, point to critical components that can be referred to in specifying compliance. Without a clear rationale for the threshold set in advance, the researcher may appear to be “fishing” for the amount of usage that produces an effect.

Some researchers reject comparison studies in which identification of the treatment group occurs after the product implementation has begun. This is based in part on the concern that the subset of users who comply with the suggested amount of usage will get more exposure to the program. More exposure will result in a larger effect. This assumes of course, that the product is effective, otherwise the students and teachers will have been wasting their time and will likely perform worse than the comparison group.

There is also the concern that the “compliers” may differ from the non-compliers (and non-users) in some characteristic that isn’t measured. And that even after controlling for measurable variables (prior achievement, ethnicity, English proficiency, etc.), there could be a personal characteristic that results in an otherwise ineffective program becoming effective for them. We reject this concern and take the position that a product’s effectiveness can be strengthened or weakened by many factors. A researcher conducting any matched comparison study can never be certain that there isn’t an unmeasured variable that is biasing it. (That’s why the What Works Clearinghouse only accepts Quasi-Experiments “with reservations.”) However, we believe that as long as the QE controls for the major factors that are known to affect outcomes, the study can meet the Every Student Succeeds Act requirement that the researcher “controls for selection bias.”

With those caveats, we believe that a QE, which identifies users by their compliance to a pre-specified level of usage, is a good design. Studies that look at the measurable variables that modify the effectiveness of a product can not only be useful for school in answering their question, “is the product likely to work in my school?” but points the developer and product marketer to ways the product can be improved.

2018-07-27

How Are Edtech Companies Thinking About Data and Research?

Forces of the rebellion were actively at work at SIIA’s Annual Conference last week in San Francisco. Snippets of conversation revealed a common theme of harnessing and leveraging data in order to better understand and serve the needs of schools and districts.

This theme was explored in depth during one panel session, “Efficacy and Research: Why It Matters So Much in the Education Market”, where edtech executives discussed the phases and roles of research as it relates to product improvement and marketing. Moderated by Pearson’s Gary Mainor, session panelists included Andrew Coulson of the MIND Research Institute, Kelli Hill of Kahn Academy, and Shawn Mahoney of McGraw Hill Education.

Coulson, who was one of the contributing reviewers of our Research Guidelines, stated that all signs are pointing to an “exponential increase” of school district customers asking for usage data. He advised fellow edtech entrepreneurs to start paying attention to fine-grained usage data, as it is becoming necessary to provide this for customers. Panelist Kelli Hill agreed with the importance of making data visible, adding that Kahn Academy proactively provides users with monthly usage reports.

In addition to providing helpful advice for edtech sales and marketing teams, the session also addressed a pervasive misconception that that all it takes is “one good study” to validate and prove the effectiveness of a program. A company could commission one rigorous randomized trial reporting positive results and obtaining endorsement from the What Works Clearinghouse, but that study might be outdated, and more importantly, not relevant to what schools and districts are looking for. Panelist Shawn Mahoney, Chief Academic Officer of McGraw-Hill Education, affirmed that school districts are interested in “super contextualized research” and look for recent and multiple studies when evaluating a product. Q&A discussions with the panelists revealed that school decision makers are quick to claim “what works for someone else might not work for us”, supporting the notion that the conduct of multiple research studies, reporting effects for various subgroups and populations of students, is much more useful and reflective of district needs.

SIIA’s gathering proved to be a fruitful event, allowing us to reconnect with old colleagues and meet new ones, and leaving us with a number of useful insights and optimistic possibilities for new directions in research.

2018-06-22

A Rebellion Against the Current Research Regime

Finally! There is a movement to make education research more relevant to educators and edtech providers alike.

At various conferences, we’ve been hearing about a rebellion against the “business as usual” of research, which fails to answer the question of, “Will this product work in this particular school or community?” For educators, the motive is to find edtech products that best serve their students’ unique needs. For edtech vendors, it’s an issue of whether research can be cost-effective, while still identifying a product’s impact, as well as helping to maximize product/market fit.

The “business as usual” approach against which folks are rebelling is that of the U.S. Education Department (ED). We’ll call it the regime. As established by the Education Sciences Reform Act of 2002 and the Institute of Education Sciences (IES), the regime anointed the randomized control trial (or RCT) as the gold standard for demonstrating that a product, program, or policy caused an outcome.

Let us illustrate two ways in which the regime fails edtech stakeholders.

First, the regime is concerned with the purity of the research design, but not whether a product is a good fit for a school given its population, resources, etc. For example, in an 80-school RCT that the Empirical team conducted under an IES contract on a statewide STEM program, we were required to report the average effect, which showed a small but significant improvement in math scores (Newman et al., 2012). The table on page 104 of the report shows that while the program improved math scores on average across all students, it didn’t improve math scores for minority students. The graph that we provide here illustrates the numbers from the table and was presented later at a research conference.

bar graph representing math, science, and reading scores for minority vs non-minority students

IES had reasons couched in experimental design for downplaying anything but the primary, average finding, however this ignores the needs of educators with large minority student populations, as well as of edtech vendors that wish to better serve minority communities.

Our RCT was also expensive and took many years, which illustrates the second failing of the regime: conventional research is too slow for the fast-moving innovative edtech development cycles, as well as too expensive to conduct enough research to address the thousands of products out there.

These issues of irrelevance and impracticality were highlighted last year in an “academic symposium” of 275 researchers, edtech innovators, funders, and others convened by the organization now called Jefferson Education Exchange (JEX). A popular rallying cry coming out of the symposium is to eschew the regime’s brand of research and begin collecting product reviews from front-line educators. This would become a Consumer Reports for edtech. Factors associated with differences in implementation are cited as a major target for data collection. Bart Epstein, JEX’s CEO, points out: “Variability among and between school cultures, priorities, preferences, professional development, and technical factors tend to affect the outcomes associated with education technology. A district leader once put it to me this way: ‘a bad intervention implemented well can produce far better outcomes than a good intervention implemented poorly’.”

Here’s why the Consumer Reports idea won’t work. Good implementation of a program can translate into gains on outcomes of interest, such as improved achievement, reduction in discipline referrals, and retention of staff, but only if the program is effective. Evidence that the product caused a gain on the outcome of interest is needed or else all you measure is the ease of implementation and student engagement. You wouldn’t know if the teachers and students were wasting their time with a product that doesn’t work.

We at Empirical Education are joining the rebellion. The guidelines for research on edtech products we recently prepared for the industry and made available here is a step toward showing an alternative to the regime while adopting important advances in the Every Student Succeeds Act (ESSA).

We share the basic concern that established ways of conducting research do not answer the basic question that educators and edtech providers have: “Is this product likely to work in this school?” But we have a different way of understanding the problem. From years of working on federal contracts (often as a small business subcontractor), we understand that ED cannot afford to oversee a large number of small contracts. When there is a policy or program to evaluate, they find it necessary to put out multi-million-dollar, multi-year contracts. These large contracts suit university researchers, who are not in a rush, and large research companies that have adjusted their overhead rates and staffing to perform on these contracts. As a consequence, the regime becomes focused on the perfection in the design, conduct, and reporting of the single study that is intended to give the product, program, or policy a thumbs-up or thumbs-down.

photo of students in a classroom on computers

There’s still a need for a causal research design that can link conditions such as resources, demographics, or teacher effectiveness with educational outcomes of interest. In research terminology, these conditions are called “moderators,” and in most causal study designs, their impact can be measured.

The rebellion should be driving an increase the number of studies by lowering their cost and turn-around time. Given our recent experience with studies of edtech products, this reduction can reach a factor of 100. Instead of one study that costs $3 million and takes 5 years, think in terms of a hundred studies that cost $30,000 each and are completed in less than a month. If for each product, there are 5 to 10 studies that are combined, they would provide enough variation and numbers of students and schools to detect differences in kinds of schools, kinds of students, and patterns of implementation so as to find where it works best. As each new study is added, our understanding of how it works and with whom improves.

It won’t be enough to have reviews of product implementation. We need an independent measure of whether—when implemented well—the intervention is capable of a positive outcome. We need to know that it can make (i.e., cause) a difference AND under what conditions. We don’t want to throw out research designs that can detect and measure effect sizes, but we should stop paying for studies that are slow and expensive.

Our guidelines for edtech research detail multiple ways that edtech providers can adapt research to better work for them, especially in the era of ESSA. Many of the key recommendations are consistent with the goals of the rebellion:

  • The usage data collected by edtech products from students and teachers gives researchers very precise information on how well the program was implemented in each school and class. It identifies the schools and classes where implementation met the threshold for which the product was designed. This is a key to lowering cost and turn-around time.
  • ESSA offers four levels of evidence which form a developmental sequence, where the base level is based on existing learning science and provides a rationale for why a school should try it. The next level looks for a correlation between an important element in the rationale (measured through usage of that part of the product) and a relevant outcome. This is accepted by ESSA as evidence of promise, informs the developers how the product works, and helps product marketing teams get the right fit to the market. a pyramid representing the 4 levels of ESSA
  • The ESSA level that provides moderate evidence that the product caused the observed impact requires a comparison group matched to the students or schools that were identified as the users. The regime requires researchers to report only the difference between the user and comparison groups on average. Our guidelines insist that researchers must also estimate the extent to which an intervention is differentially effective for different demographic categories or implementation conditions.

From the point of view of the regime, nothing in these guidelines actually breaks the rules and regulations of ESSA’s evidence standards. Educators, developers, and researchers should feel empowered to collect data on implementation, calculate subgroup impacts, and use their own data to generate evidence sufficient for their own decisions.

A version of this article was published in the Edmarket Essentials magazine.

2018-05-09

Updated Research Guidelines Will Improve Education Technology Products and Provide More Value to Schools

Recommendations include 16 best practices for the design, implementation, and reporting of Usable Evidence for Educators

Palo Alto, CA (April 25, 2018) – Empirical Education Inc. and the Education Technology Industry Network (ETIN) of SIIA released an important update to the “Guidelines for Conducting and Reporting Edtech Impact Research in U.S. K-12 Schools” today.

Authored by Empirical Education researchers, Drs. Denis Newman, Andrew Jaciw, and Valeriy Lazarev, the Guidelines detail 16 best practices for the design, implementation, and reporting of efficacy research of education technology. Recommendations range from completing the product’s logic model before fielding it to disseminating a study’s results in accessible and non-technical language.

The Guidelines were first introduced in July 2017 at ETIN’s Edtech Impact Symposium to address the changing demand for research. They served to address new challenges driven by the accelerated pace of edtech development and product releases, the movement of new software to the cloud, and the passage of the Every Student Succeeds Act (ESSA). The authors committed to making regular updates to keep pace with technical advances in edtech and research methods.

“Our collaboration with ETIN brought the right mix of practical expertise to this important document,” said Denis Newman, CEO of Empirical Education and lead author of the Guidelines. “ETIN provided valuable expertise in edtech marketing, policy, and development. With over a decade of experience evaluating policies, programs, and products for the U.S. Department of Education, major research organizations, and publishers, Empirical Education brought a deep understanding of how studies are traditionally performed and how they can be improved in the future. Our experience with our Evidence as a Service™ offering to investors and developers of edtech products also informed the guidelines.”

The current edition advocates for analysis of usage patterns in the data collected routinely by edtech applications. These patterns help to identify classrooms and schools with adequate implementation and lead to lower-cost faster turn-around research. So rather than investing hundreds of thousands of dollars in a single large-scale study, developers should consider multiple small-scale studies. The authors point to the advantages of looking at subgroup analysis to better understand how and for whom the product works best, thus more directly answering common educator questions. Issues with quality of implementation are addressed in greater depth, and the visual design of the Guidelines has been refined for improved readability.

“These guidelines may spark a rebellion against the research business as usual, which doesn’t help educators know whether an edtech product will work for their specific populations. They also provide a basis for schools and developers to partner to make products better,” said Mitch Weisburgh, Managing Partner of Academic Business Advisors, LLC and President of ETIN, who has moderated panels and webinars on edtech research.

Empirical Education, in partnership with a variety of organizations, is conducting webinars to help explain the updates to the Guidelines, as well as to discuss the importance of these best practices in the age of ESSA. The updated Guidelines are available here: https://www.empiricaleducation.com/research-guidelines/.

2018-04-25

Where's Denis?

It’s been a busy month for Empirical CEO Denis Newman, who’s been in absentia at our Palo Alto office as he jet-sets around the country to spread the good word of rigorous evidence in education research.

His first stop was Washington, DC and the conference of the Society for Research on Educational Effectiveness (SREE). This was an opportunity to get together with collaborators, as well as plot proposal writing, blog postings, webinars, and revisions to our research guidelines for edtech impact studies. Andrew Jaciw, Empirical’s Chief Scientist, kept up the company’s methodological reputation with a paper presentation on “Leveraging Fidelity Data to Make Sense of Impact Results.” For Denis, a highlight was dinner with Peg Griffin, a longtime friend and his co-author on The Construction Zone. Then it was on to Austin, TX, for a very different kind of meeting—more of a festival, really.

At this year’s SXSWEDU, Denis was one of three speakers on the panel, “Can Evidence Even Keep Up with Edtech?” The problem presented by the panel was that edtech, as a rapidly moving field, seems to be outpacing the rate of research that stakeholders may want to use to evaluate these products. How, then, could education stakeholders make informed decisions about whether to use edtech products?

According to Denis, the most important thing is for a district to have enough information to know whether a given edtech product may or may not work for that district’s unique population and context. Therefore, researchers may need to adapt their methods both to be able to differentiate a product’s impact between subgroups, as well as to meet the faster timelines of edtech product development. Empirical’s own solution to this quandry, Evidence as a ServiceTM, offers quick-turnaround research reports that can examine impact and outcomes for specific student subgroups, with methodology that is flexible but rigorous enough to meet ESSA standards.

Denis praised the panel, stating, “In the festival’s spirit of invention, our moderator, Mitch Weisberg, masterfully engaged the audience from the beginning to pose the questions for the panel. Great questions, too. I got to cover all of my prepared talking points!”

You can read more coverage of our SXSWEDU panel on EdSurge.

After the panel, a string of meetings and parties kept the energy high and continued to show the growing interest in efficacy. The ISTE meetup was particularly important following this theme. The concern raised by the ISTE leadership and its members—which are school-based technology users—was that traditional research doesn’t tell the practitioners whether a product is likely to work in their school, given its resources and student demographics. Users are faced with hundreds of choices in any product category and have little information for narrowing down the choice to a few that are worth piloting.

Following SXSWEDU, it was back to DC for the Consortium for School Networking (CoSN) conference. Denis participated in the annual Feedback Forum hosted by CoSN and the Software & Information Industry Association (SIIA), where SIIA—representing edtech developers—looked for feedback from the CIOs and other school district leaders. This year, SIIA was looking for feedback that would help the Empirical team improve the edtech research guidelines, which are sponsored by SIIA’s Education Technology Industry Network (ETIN). Linda Winter moderated and ran the session like a focus group, asking questions such as:

  • What data do you need from products to gauge engagement?
  • How can the relationship of engagement and achievement indicate that a product is working?
  • What is the role of pilots in measuring success?
  • And before a pilot decision is made, what do CoSN members need to know about edtech products to decide if they are likely to work?

The CoSN members were brutally honest, pointing out that as the leaders responsible for the infrastructure, they were concerned with implementability, bandwidth requirements, and standards such as single sign-on. Whether the software improved learning was secondary—if teachers couldn’t get the program to work, it hardly mattered how effective it may be in other districts.

Now, Denis is preparing for the rest of the spring conference season. Next stop will be New York City and the American Education Research Association (AERA) conference, which attracts over 20,000 researchers annually. The Empirical team will be presenting four studies, as well as co-hosting a cocktail reception with AERA’s school research division. Then, it’s back on the plane for ASU-GSV in San Diego.

For more information about Evidence as a Service, the edtech research guidelines, or to invite Denis to speak at your event, please email rmeans@empiricaleducation.com

2018-03-26

Presenting at AERA 2018

We will again be presenting at the annual meeting of the American Educational Research Association (AERA). Join the Empirical Education team in New York City from April 13-17, 2018.

Research presentations will include the following.

For Quasi-Experiments on EdTech Products, What Counts as Being Treated?
Authors: Val Lazarev, Denis Newman, & Malvika Bhagwat
In Roundtable Session: Examining the Impact of Accountability Systems on Both Teachers and Students
Friday, April 13 - 2:15 to 3:45pm
New York Marriott Marquis, Fifth Floor, Westside Ballroom Salon 3

Abstract: Edtech products are becoming increasingly prevalent in K-12 schools and the needs of schools to evaluate their value for students calls for a program of rigorous research, at least at the level 2 of the ESSA standards for evidence. This paper draws on our experience conducting a large scale quasi-experiment in California schools. The nature of the product’s wide-ranging intensity of implementation presented a challenge in identifying schools that had used the product adequately enough to be considered part of the treatment group.


Planning Impact Evaluations Over the Long Term: The Art of Anticipating and Adapting
Authors: Andrew P Jaciw & Thanh Thi Nguyen
In Session: The Challenges and Successes of Conducting Large-Scale Educational Research
Saturday, April 14 - 2:15 to 3:45pm
Sheraton New York Times Square, Second Floor, Central Park East Room

Abstract: Perspective. It is good practice to identify core research questions and important elements of study designs a-priori, to prevent post-hoc “fishing” exercises and reduce the role of drawing false-positive conclusions [16,19]. However, programs in education, and evaluations of them, evolve [6] making it difficult to follow a charted course. For example, in the lifetime of a program and its evaluation, new curricular content or evidence standards for evaluations may be introduced and thus drive changes in program implementation and evaluation.

Objectives. This work presents three cases from program impact evaluations conducted through the Department of Education. In each case, unanticipated results or changes in study context had significant consequences for program recipients, developers and evaluators. We discuss responses, either enacted or envisioned, for addressing these challenges. The work is intended to serve as a practical guide for researchers and evaluators who encounter similar issues.

Methods/Data Sources/Results. The first case concerns the problem of outcome measures keeping pace with evolving content standards. For example, in assessing impacts of science programs, program developers and evaluators are challenged to find assessments that align with Next Generation Science Standards (NGSS). Existing NGSS-aligned assessments are largely untested or in development, resulting in the evaluator having to find, adapt or develop instruments with strong reliability, and construct and face validity – ones that will be accepted by independent review and not considered over-aligned to the interventions. We describe a hands-on approach to working with a state testing agency to develop forms to assess impacts on science generally, and on constructs more-specifically aligned to the program evaluated. The second case concerns the problem of reprioritizing research questions mid-study. As noted above, researchers often identify primary (confirmatory) research questions at the outset of a study. Such questions are held to high evidence standards, and are differentiated from exploratory questions, which often originate after examining the data, and must be replicated to be considered reliable [16]. However, sometimes, exploratory analyses produce unanticipated results that may be highly consequential. The evaluator must grapple with the dilemma of whether to re-prioritize the result, or attempt to proceed with replication. We discuss this issue with reference to an RCT in which the dilemma arose. The third addresses the problem of designing and implementing a study that meets one set of evidence standards, when the results will be reviewed according to a later version of those standards. A practical question is what to do when this happens and consequently the study falls under a lower tier of the new evidence standard. With reference to an actual case, we consider several response options, including assessing the consequence of this reclassification for future funding of the program, and augmenting the research design to satisfy the new standards of evidence.

Significance. Responding to demands of changing contexts, programs in the social sciences are moving targets. They demand a flexible but well-reasoned and justified approach to evaluation. This session provides practical examples and is intended to promote discussion for generating solutions to challenges of this kind.


Indicators of Successful Teacher Recruitment and Retention in Oklahoma Rural Schools
Authors: Val Lazarev, Megan Toby, Jenna Lynn Zacamy, Denis Newman, & Li Lin
In Session: Teacher Effectiveness, Retention, and Coaching
Saturday, April 14 - 4:05 to 6:05pm
New York Marriott Marquis, Fifth Floor, Booth

Abstract: The purpose of this study was to identify factors associated with successful recruitment and retention of teachers in Oklahoma rural school districts, in order to highlight potential strategies to address Oklahoma’s teaching shortage. The study was designed to identify teacher-level, district-level, and community characteristics that predict which teachers are most likely to be successfully recruited and retained. A key finding is that for teachers in rural schools, total compensation and increased responsibilities in job assignment are positively associated with successful recruitment and retention. Evidence provided by this study can be used to inform incentive schemes to help retain certain groups of teachers and increase retention rates overall.


Teacher Evaluation Rubric Properties and Associations with School Characteristics: Evidence from the Texas Evaluation System
Authors: Val Lazarev, Thanh Thi Nguyen, Denis Newman, Jenna Lynn Zacamy, Li Lin
In Session: Teacher Evaluation Under the Microscope
Tuesday, April 17 - 12:25 to 1:55pm
New York Marriott Marquis, Seventh Floor, Astor Ballroom

Abstract: A 2009 seminal report, The Widget Effect, alerted the nation to the tendency of traditional teacher evaluation systems to treat teachers like widgets, undifferentiated in their level of effectiveness. Since then, a growing body of research, coupled with new federal initiatives, has catalyzed the reform of such systems. In 2014-15, Texas piloted its reformed evaluation system, collecting classroom observation rubric ratings from over 8000 teachers across 51 school districts. This study analyzed that large dataset and found that 26.5 percent, compared to 2 percent under previous measures, of teachers were rated below proficient. The study also found a promising indication of low bias in the rubric ratings stemming from school characteristics, given that they were minimally associated with observation ratings.

We look forward to seeing you at our sessions to discuss our research. We’re also co-hosting a cocktail reception with Division H! If you’d like an invite, let us know.

2018-03-06

Jefferson Education Accelerator Contracts with Empirical for Evidence as a Service™

Jefferson Education Accelerator (JEA) has contracted with Empirical Education Inc. for research services that will provide evidence of the impact of education technology products developed by their portfolio companies. JEA’s mission is to support and evaluate promising edtech solutions in order to help educators make more informed decisions about the products they invest in. The study is designed to meet level 2 or “moderate” evidence as defined by the Every Student Succeeds Act. Empirical will provide a Student Impact Report under its Evidence as a Service offering, which combines student-level product usage data and a school district’s administrative data to conduct a comparison group study. Denis Newman, Empirical’s CEO stated, “This is a perfect application of our Evidence as a Service product, which provides fast answers to questions about which kids will benefit the most from any particular learning program.” Todd Bloom, JEA’s Chief Academic Officer and Research Associate Professor at UVA’s Curry School of Education, commented: “Empirical Education is a highly respected research firm and offers the type of aggressive timeline that is sorely needed in the fast-paced edtech market.” A report on impact in the school year 2017-2018 is expected to be completed in July.

2018-02-20

Spring 2018 Conference Season is Taking Shape


We’ll be on the road again this spring.

SREE

Andrew Jaciw and Denis Newman will be in Washington DC for the annual spring conference of the The Society for Research on Educational Effectiveness (SREE), the premier conference on rigorous research. Andrew Jaciw will present his paper: Leveraging Fidelity Data to Making Sense of Impact Results: Informing Practice through Research. His presentation will be a part of Session 2I: Research Methods - Post-Random Assignment Models: Fidelity, Attrition, Mediation & More from 8-10am on Thursday, March 1.

SXSW EDU

In March, Denis Newman will be attending SXSW EDU Conference & Festival in Austin, TX and presenting on a panel along with Malvika Bhagwat, Jason Palmer, and Karen Billings titled Can Evidence Even Keep Up with EdTech? This will address how researchers and companies can produce evidence that products work—in time for educators and administrators to make a knowledgeable buying decision under accelerating timelines.

AERA

Empirical staff will be presenting in 4 different sessions at the annual conference of the American Educational Research Association (AERA) in NYC in April, all under Division H (Research, Evaluation, and Assessment in Schools).

  1. For Quasi-experiments on Edtech Products, What Counts as Being Treated?
  2. Teacher evaluation rubric properties and associations with school characteristics: Evidence from the Texas evaluation system
  3. Indicators of Successful Teacher Recruitment and Retention in Oklahoma Rural Schools
  4. The Challenges and Successes of Conducting Large-scale Educational Research

In addition to these presentations, we are planning another of our celebrated receptions in NYC so stay tuned for details.

ISTE

A panel on our Research Guidelines has been accepted at this major convention, considered the epicenter of edtech with thousands of users and 100s of companies, held this year in Chicago from June 24–27.

2017-12-18
Archive