blog posts and news stories

Conference Season 2019

Are you staying warm this winter? Can’t wait for the spring? Us either, with spring conference season right around the corner! Find our Empirical team traveling bicoastally in these upcoming months.

We’re starting the season right in our backyard at the Bay Area Learning Analytics (BayLAN) Conference at Stanford University on March 2, 2019! CEO Denis Newman will be presenting on a panel on the importance of efficacy with Jeremy Roschelle of Digital Promise. Senior Research Scientist Valeriy Lazarev will also be attending the conference.

The next day, the team will be off to SXSW EDU in Austin, Texas! Our goal is to talk to people about the new venture, Evidentally.

Then we’re headed to Washington D.C. to attend the annual Society for Research on Educational Effectiveness (SREE) Conference! Andrew Jaciw will be presenting “A Study of the Impact of the CREATE Residency Program on Teacher Socio-Emotional and Self-Regulatory Outcomes”. We will be presenting on Friday March 8, 2:30 PM - 4:00 PM during the “Social and Emotional Learning in Education Settings” sessions in Ballroom 1. Denis will also be attending and with Andrew, meeting with many research colleagues. If you can’t catch us in D.C., you can find Andrew back in the Bay Area at the sixth annual Carnegie Foundation Summit.

For the last leg of spring conferences, we’ll be back at the American Educational Research Association’s Annual (AERA) Meeting in Toronto, Canada from April 6th to 9th. There you’ll be able to hear more about the CREATE Teacher Residency Research Study presented by Andrew Jaciw, joined by Vice President of Research Operations Jenna Zacamy along with our new Research Manager, Audra Wingard. And for the first time in 10 years, you won’t be finding Denis at AERA… Instead he’ll be at the ASU GSV Summit in San Diego, California!

2019-02-12

Evidentally, a New Company Taking on Edtech Efficacy Analytics

Empirical Education has launched Evidentally, Inc., a new company that specializes in helping edtech companies and their investors make more effective products. Founded by Denis Newman, CEO, and Val Lazarev, Chief Product Architect, the company conducts rapid cycle evaluations that meet the federal Every Student Succeeds Act standards for moderate and promising evidence. The efficacy analytics leverage the edtech product’s usage metrics to efficiently identify states and districts with sufficient usage to make impact studies feasible. Evidentally is actively servicing and securing initial clients, and is seeking seed funding to prepare for expansion. In the meantime, the company is being incubated by Empirical Education, which has transferred intellectual property relating to its R&D prototypes of the service and is providing staffing through a services agreement. The Evidentally team will be meeting with partners and investors at SXSW EDU, EdSurge Immersion, ASU GSV Summit, and ISTE. Let’s talk!

2019-02-06

Research on AMSTI Presented to the Alabama State Board of Education

On January 13, Dr. Eric Mackey, Alabama’s new State Superintendent of Education, presented our rapid cycle evaluation of Alabama Math, Science, and Technology Initiative (AMSTI). The study is based on results for the 2016-17 school year, for which outcome data were available at the time the Alabama State Department of Education (ALSDE) contracted with Empirical in July 2018.

AMSTI is ALSDE’s initiative to improve math and science teaching statewide; the program, which started over 20 years ago, now operates in over 900 schools across the state.

Our current project, led by Val Lazarev compares classes taught by teachers who were fully trained in AMSTI with matched classrooms taught by teachers with no AMSTI training. The overall results, shown in the above graph were similar in magnitude to Empirical’s 2012 study directed by Denis Newman and designed by Empirical’s Chief Scientist, Andrew Jaciw. That cluster-randomized trial, which involved 82 schools and ~700 teachers, showed AMSTI had a small overall positive effect. The earlier study also showed that AMSTI may be exacerbating the achievement gap between black and white students. Since ALSDE was also interested in information that could improve AMSTI, the current study examined a number of subgroup impacts. In this project we did not find a difference between the value of AMSTI for black and white students. We did find a strong benefit for females in science. And for English learners, there was a negative effect of being in a science class of an AMSTI-trained teacher. The state board expressed concern and a commitment to using the results to guide improvement of the program.

Download both of the reports here.

2019-01-22

View from the West Coast: Relevance is More Important than Methodological Purity

Bob Slavin published a blog post in which he argues that evaluation research can be damaged by using the cloud-based data routinely collected by today’s education technology (edtech). We see serious flaws with this argument and it is quite clear that he directly opposes the position we have taken in a number of papers and postings, and also discussed as part of the west coast conversations about education research policy. Namely, we’ve argued that using the usage data routinely collected by edtech can greatly improve the relevance and usefulness of evaluations.

Bob’s argument is that if you use data collected during the implementation of the program to identify students and teachers who used the product as intended, you introduce bias. The case he is concerned with is in a matched comparison study (or quasi-experiment) where the researcher has to find the right matching students or classes to the students using the edtech. The key point he makes is:

“students who used the computers [or edtech product being evaluated] were more motivated or skilled than other students in ways the pretests do not detect.”

That is, there is an unmeasured characteristic, let’s call it motivation, that both explains the student’s desire to use the product and explains why they did better on the outcome measure. Since the characteristic is not measured, you don’t know which students in the control classes have this motivation. If you select the matching students only on the basis of their having the same pretest level, demographics, and other measured characteristics but you don’t match on “motivation”, you have biased the result.

The first thing to note about this concern, is that there may not be a factor such motivation that explains both edtech usage and the favorable outcome. It is just that there is a theoretical possibility that such a variable is driving the result. The bias may or may not be there and to reject a method because there is an unverifiable possibility of bias is an extreme move.

Second, it is interesting that he uses an example that seems concrete but is not at all specific to the bias mechanism he’s worried about.

“Sometimes teachers use computer access as a reward for good work, or as an extension activity, in which case the bias is obvious.”

This isn’t a problem of an unmeasured variable at all. The problem is that the usage didn’t cause the improvement—rather, the improvement caused the usage. This would be a problem in a randomized “gold standard” experiment. The example makes it sound like the problem is “obvious” and concrete, when Bob’s concern is purely theoretical. This example is a good argument for having the kind of implementation analyses of the sort that ISTE is doing in their Edtech Advisor and Jefferson Education Exchange has embarked on.

What is most disturbing about Bob’s blog post is that he makes a statement that is not supported by the ESSA definitions or U.S. Department of Education regulations or guidance. He claims that:

“In order to reach the second level (“moderate”) of ESSA or Evidence for ESSA, a matched study must do everything a randomized study does, including emphasizing ITT [Intent To Treat, i.e., using all students in the pre-identified schools or classes where administrators intended to use the product] estimates, with the exception of randomizing at the start.”

It is true that Bob’s own site Evidence for ESSA, will not accept any study that does not follow the ITT protocol but ESSA, itself, does not require that constraint.

Essentially, Bob is throwing away relevance to school decision-makers in order to maintain an unnecessary purity of research design. School decision-makers care whether the product is likely to work with their school’s population and available resources. Can it solve their problem (e.g., reduce achievement gaps among demographic categories) if they can implement it adequately? Disallowing efficacy studies that consider compliance to a pre-specified level of usage in selecting the “treatment group” is to throw out relevance in favor or methodological purity. Yes, there is a potential for bias, which is why ESSA considers matched-comparison efficacy studies to be “moderate” evidence. But school decisions aren’t made on the basis of which product has the largest average effect when all the non-users are included. A measure of subgroup differences, when the implementation is adequate, provides more useful information.

2018-12-27

Classrooms and Districts: Breaking Down Silos in Education Research and Evidence

I just got back from Edsurge’s Fusion conference. The theme, aimed at classroom and school leaders, was personalizing classroom instruction. This is guided by learning science, which includes brain development and the impact of trauma, as well as empathetic caregiving, as Pamela Cantor beautifully explained in her keynote. It also leads to detailed characterizations of learner variability being explored at Digital Promise by Vic Vuchic’s team, which is providing teachers with mappings between classroom goals and tools and strategies that can address learners who vary in background, cognitive skills, and socio-emotional character.

One of the conference tracks that particularly interested me was the workshops and discussions under “Research & Evidence”. Here is where I experienced a disconnect between Empirical ’s research policy-oriented work interpreting ESSA and Fusion’s focus on improving the classroom.

  • The Fusion conference is focused at the classroom level, where teachers along with their coaches and school leaders are making decisions about personalizing the instruction to students. They advocate basing decisions on research and evidence from the learning sciences.
  • Our work, also using research and evidence, has been focused on the school district level where decisions are about procurement and implementation of educational materials including the technical infrastructure needed, for example, for edtech products.

While the classroom and district levels have different needs and resources and look to different areas of scientific expertise, they need not form conceptual silos. But the differences need to be understood.

Consider the different ways we look at piloting a new product.

  • The Digital Promise edtech pilot framework attempts to move schools toward a more planful approach by getting them to identify and quantify the problem for which the product being piloted could be a solution. The success in the pilot classrooms is evaluated by the teachers, where detailed understandings by the teacher don’t call for statistical comparisons. Their framework points to tools such as the RCE Coach that can help with the statistics to support local decisions.
  • Our work looks at pilots differently. Pilots are excellent for understanding implementability and classroom acceptance (and working with developers to improve the product), but even with rapid cycle tools, the quantitative outcomes are usually not available in time for local decisions. We are more interested in how data can be accumulated nationally from thousands of pilots so that teachers and administrators can get information on which products are likely to work in their classrooms given their local demographics and resources. This is where review sites like Edsurge product reviews or Noodle’s ProcureK12 could be enhanced with evidence about for whom, and under what conditions, the products work best. With over 5,000 edtech products, an initial filter to help choose what a school should pilot will be necessary.

A framework that puts these two approaches together is promulgated in the Every Student Succeeds Act (ESSA). ESSA defines four levels of evidence, based on the strength of the causal inference about whether the product works. More than just a system for rating the scientific rigor of a study, it is a guide to developing a research program with a basis in learning science. The base level says that the program must have a rationale. This brings us back to the Digital Promise edtech pilot framework needing teachers to define their problem. The ESSA level 1 rationale is what the pilot framework calls for. Schools must start thinking through what the problem is that needs to be solved and why a particular product is likely to be a solution. This base level sets up the communication between educators and developers about not just whether the product works in the classroom, but how to improve it.

The next level in ESSA, called “correlational,” is considered weak evidence, because it shows only that the product has “promise” and is worth studying with a stronger method. However, this level is far more useful as a way for developers to gather information about which parts of the program are driving student results, and which patterns of usage may be detrimental. Schools can see if there is an amount of usage that maximizes the value of the product (rather than depending solely on the developer’s rationale). This level 2 calls for piloting the program and examining quantitative results. To get correlational results, the pilot must have enough students and may require going beyond a single school. This is a reason that we usually look for a district’s involvement in a pilot.

The top two levels in the ESSA scheme involve comparisons of students and teachers who use the product to those who do not. These are the levels where it begins to make sense to combine a number of studies of the same product from different districts in a statistical process called meta-analysis so we can start to make generalizations. At these levels, it is very important to look beyond just the comparison of the program group and the control group and gather information on the characteristics of schools, teachers, and students who benefit most (and least) from the product. This is the evidence of most value to product review sites.

When it comes to characterizing schools, teachers, and students, the “classroom” and the “district” approach have different, but equally important, needs.

  • The learner variability project has very fine-grained categories that teachers are able to establish for the students in their class.
  • For generalizable evidence, we need characteristics that are routinely collected by the schools. To make data analysis for efficacy studies a common occurrence, we have to avoid expensive surveys and testing of students that are used only for the research. Furthermore, the research community must reach consensus on a limited number of variables that will be used in research. Fortunately, another aspect of ESSA is the broadening of routine data collection for accountability purposes, so that information on improvements in socio-emotional learning or school climate will be usable in studies.

Edsurge and Digital Promise are part of a west coast contingent of researchers, funders, policymakers, and edtech developers that has been discussing these issues. We look forward to continuing this conversation within the framework provided by ESSA. When we look at the ESSA levels as not just vertical but building out from concrete classroom experience to more abstract and general results from thousands of school districts, then learning science and efficacy research are combined. This strengthens our ability to serve all students, teachers, and school leaders.

2018-10-08

The Evaluation of CREATE Continues

Empirical Education began conducting the evaluation of Collaboration and Reflection to Enhance Atlanta Teacher Effectiveness (CREATE) in 2015 under a subcontract with Atlanta Neighborhood Charter Schools (ANCS) as part of their Investing in Innovation (i3) Development grant. Since our last CREATE update, we’ve extended this work through the Supporting Effective Educator Development (SEED) Grant Program. The SEED grant provides continued funding for three more cohorts of participants and expands the research to include experienced educators (those not in the CREATE residency program) in CREATE schools. The grant was awarded to Georgia State University and includes partnerships with ANCS, Empirical Education (as the external evaluator), and local schools and districts.

Similar to the i3 work, we’re following a treatment and comparison group over the course of the three-year CREATE residency program and looking at impacts on teacher effectiveness, teacher retention, and student achievement. With the SEED project, we will also be able to follow Cohort 3 and 4 for an additional 1-2 years following residency. Surveys will measure perceived levels of social capital, school climate and community, collaboration, resilience, and mindfulness, in addition to other topics. Recruitment for Cohort 4 began this past spring and continued through the summer, resulting in approximately 70 new participants.

One of the goals of the expanded CREATE programming is to support the effectiveness and social capital of experienced educators in CREATE schools. Any experienced educator in a CREATE school who attends CREATE professional learning activities will be invited to participate in the research study. Surveys will measure similar topics to those measured in the quasi-experiment and we conduct individual interviews with a sample of participants to gain an in-depth understanding of the participant experience.

We have completed our first year of experienced educator research and continue to recruit participants, on an ongoing basis, into the second year of the study. We currently have 88 participants and counting.

2018-10-03

The Rebel Alliance is Growing

The rebellion against the old NCLB way of doing efficacy research is gaining force. A growing community among edtech developers, funders, researchers, and school users has been meeting in an attempt to reach a consensus on an alternative built on ESSA.

This is being assisted by openness in the directions currently being pursued by IES. In fact, we are moving into a new phase marked by two-way communication with the regime. While the rebellion hasn’t yet handed over its lightsabers, it is encouraged by the level of interest from prominent researchers.

From these ongoing discussions, there have been some radical suggestions inching toward consensus. A basic idea now being questioned is this:

The difference between the average of the treatment group and the average of the control group is a valid measure of effectiveness.

There are two problems with this:

  1. In schools, there’s no “placebo” or something that looks like a useful program but is known to have zero effectiveness. Whatever is going on in the schools, or classes, or with teachers and students in the control condition has some usefulness or effectiveness. The usefulness of the activities in the control classes or schools may be greater than the activities being evaluated in the study, or may be not as useful. The study may find that the “effectiveness” of the activities being studied is positive, negative, or too small to be discerned statistically by the study. In any case, the size (negative or positive) of the effect is determined as much by what’s being done in the control group as the treatment group.
  2. Few educational activities have the same level of usefulness for all teachers and students. Looking at only the average will obscure the differences. For example, we ran a very large study for the U.S. Department of Education of a STEM program where we found, on average, the program was effective. What the department didn’t report was that it only worked for the white kids, not the black kids. The program increased instead of reducing the existing achievement gap. If you are considering adopting this STEM program, the impact on the different subgroups is relevant–a high minority school district may want to avoid it. Also, to make the program better, the developers need to know where it works and where it doesn’t. Again, the average impact is not just meaningless but also can be misleading.

A solution to the overuse of the average difference from studies is to conduct a lot more studies. The price the ED paid for our large study could have paid for 30 studies of the kind we are now conducting in the same state of the same program; in 10% of the time of the original study. If we had 10 different studies for each program, where studies are conducted in different school districts with different populations and levels of resources, the “average” across these studies start to make sense. Importantly, the average across these 10 studies for each of the subgroups will give a valid picture of where, how, and with which students and teachers the program tends to work best. This kind of averaging used in research is called meta-analysis and allows many small differences found across studies to build on the power of each study to generate reliable findings.

If developers or publishers of the products being used in schools took advantage of their hundreds of implementations to gather data, and if schools would be prepared to share student data for this research, we could have researcher findings that both help schools decide what will likely work for them and help developers improve their products.

2018-09-21

Which Came First: The Journal or the Conference?

You may have heard of APPAM, but do you really know what they do? They organize an annual conference? They publish a journal? Yes, they do all that and more!

APPAM stands for the Association for Public Policy Analysis and Management. APPAM is dedicated to improving public policy and management by fostering excellence in research, analysis, and education. The first APPAM Fall Research Conference occurred in 1979 in Chicago. The first issue of the Journal of Policy Analysis and Management appeared in 1981.

Why are we talking about APPAM now? While we’ve attended the APPAM conference multiple years in the past, the upcoming conference poses a unique opportunity for us. This year, our chief scientist, Andrew Jaciw, is acting as guest editor of a special issue of Evaluation Review on multi-armed randomized experiments. As part of this effort, to encourage discussion of the topic, he proposed three panels that were accepted at APPAM.

Andrew will chair the first panel titled Information Benefits and Statistical Challenges of Complex Multi-Armed Trials: Innovative Designs for Nuanced Questions.

In the second panel, Andrew will be presenting a paper that he co-wrote with Senior Research Manager Thanh Nguyen titled Using Multi-Armed Experiments to Test “Improvement Versions” of Programs: When Beneficence Matters. This presentation will take place on Friday, November 9, 2018 at 9:30am (in Marriott Wardman Park, Marriott Balcony B - Mezz Level).

In the third panel he submitted, Larry Orr, Joe Newhouse, and Judith Gueron (with Becca Maynard as discussant) should provide an important retrospective. As pioneers of social science experiments, the contributors will share experiences and important lessons learned.

Some of these panelists will also be submitting their papers to the special edition of the Evaluation Review. We will update this blog with a link to that journal issue once it has been published.

2018-08-21

New Multi-State RCT with Imagine Learning

Empirical Education is excited to announce a new study on the effectiveness of Imagine Math, an online supplemental math program that helps students build conceptual understanding, problem-solving skills, and a resilient attitude toward math. The program provides adaptive instruction so that students can work at their own pace and offers live support from certified math teachers as students work through the content. Imagine Math also includes diagnostic benchmarks that allows educators to track progress at the student, class, school, and district level.

The research questions to be answered by this study are:

  1. What is the impact of Imagine Math on student achievement in mathematics in grades 6–8?
  2. Is the impact of Imagine Math different for students with diverse characteristics, such as those starting with weak or strong content-area skills?
  3. Are differences in the extent of use of Imagine Math, such as the number of lessons completed, associated with differences in student outcomes?

The new study will use a randomized control trial (RCT) or randomized experiment in which two equivalent groups of students are formed through random assignment. The experiment will specifically use a within-teacher RCT design, with randomization taking place at the classroom level for eligible math classes in grades 6–8.

Eligible classes will be randomly assigned to either use or not use Imagine Math during the school year, with academic achievement compared at the end of the year, in order to determine the impact of the program on grade 6-8 mathematics achievement. In addition, Empirical Education will make use of Imagine Math’s usage data for potential analysis of the program’s impact on different subgroups of users.

This is Empirical Education’s first project with Imagine Learning, highlighting our extensive experience conducting large-scale, rigorous, experimental impact studies. The study is commissioned by Imagine Learning and will take place in multiple school districts and states across the country, including Hawaii, Alabama, Alaska, and Delaware.

2018-08-03

For Quasi-experiments on the Efficacy of Edtech Products, it is a Good Idea to Use Usage Data to Identify Who the Users Are

With edtech products, the usage data allows for precise measures of exposure and whether critical elements of the product were implemented. Providers often specify an amount of exposure or the kind of usage that is required to make a difference. Furthermore, educators often want to know whether the program has an effect when implemented as intended. Researchers can readily use data generated by the product (usage metrics) to identify compliant users, or to measure the kind and amount of implementation.

Since researchers generally track product implementation and statistical methods allow for adjustments for implementation differences, it is possible to estimate the impact on successful implementers, or technically, on a subset of study participants who were compliant with treatment. It is, however, very important that the criteria researchers use in setting a threshold be grounded in a model of how the program works. This will, for example, point to critical components that can be referred to in specifying compliance. Without a clear rationale for the threshold set in advance, the researcher may appear to be “fishing” for the amount of usage that produces an effect.

Some researchers reject comparison studies in which identification of the treatment group occurs after the product implementation has begun. This is based in part on the concern that the subset of users who comply with the suggested amount of usage will get more exposure to the program. More exposure will result in a larger effect. This assumes of course, that the product is effective, otherwise the students and teachers will have been wasting their time and will likely perform worse than the comparison group.

There is also the concern that the “compliers” may differ from the non-compliers (and non-users) in some characteristic that isn’t measured. And that even after controlling for measurable variables (prior achievement, ethnicity, English proficiency, etc.), there could be a personal characteristic that results in an otherwise ineffective program becoming effective for them. We reject this concern and take the position that a product’s effectiveness can be strengthened or weakened by many factors. A researcher conducting any matched comparison study can never be certain that there isn’t an unmeasured variable that is biasing it. (That’s why the What Works Clearinghouse only accepts Quasi-Experiments “with reservations.”) However, we believe that as long as the QE controls for the major factors that are known to affect outcomes, the study can meet the Every Student Succeeds Act requirement that the researcher “controls for selection bias.”

With those caveats, we believe that a QE, which identifies users by their compliance to a pre-specified level of usage, is a good design. Studies that look at the measurable variables that modify the effectiveness of a product can not only be useful for school in answering their question, “is the product likely to work in my school?” but points the developer and product marketer to ways the product can be improved.

2018-07-27
Archive