blog posts and news stories

Doing Something Truly Original in the Music of Program Evaluation

Is it possible to do something truly original in science?

How about in Quant evaluations in the social sciences?

The operative word here is "truly". I have in mind contributions that are "outside the box".

I would argue that standard Quant provides limited opportunity for originality. Yet, QuantCrit forces us to dig deep to arrive at original solutions - to reinterpret, reconfigure, and in some cases reinvent Quant approaches.

That is, I contend that Quant Crit asks the kinds of questions that force us to go outside the box of conventional assumptions and develop instrumentation and solutions that are broader and better. Yet, I qualify this by saying (and some will disagree), that doing so does not require us to give up the core assumptions that are at the foundation of Quant evaluation methods.

I find that developments and originality in jazz music closely parallel what I have in mind in discussing the evolution of genres in Quant evaluations, and what it means to conceive of and address problems and opportunities outside the box. (You can skip this section, and go straight to the final thoughts, but I would love to share my ideas with you here.)

An Analogy for Originality in the Artistry of Herbie Hancock

Last week I took my daughter, Maya, to see the legendary keyboardist Herbie Hancock perform live with Lionel Loueke, Terrance Blanchard and others. CHILLS along my spine, is how I would describe it. I found myself fixating on Hancock’s hand movements on the keys, and how he swiveled between the grand piano and the KORG synthesizer, and asking: "the improvisation is on-point all the time – how does he know how to go right there?"

Hancock, winner of an Academy award, and 14 Grammys, is a (if not the) major force in the evolution of jazz through the last 60 years, up to the contemporary scene.

His main start was in the 1960's as the pianist in Miles Davis' Second Great Quintet. (When Hancock was dispirited, Davis famously advised him "don't play the butter notes"). Check out this performance by the band of Wayne Shorter's composition "Footprints" from 1967 – note the symbiosis among the group and Hancocks respectful treatment of the melody.

In the 1970’s Hancock developed styles of jazz fusion and funk with the Headhunters (e.g., Chameleon.

Then in the 1980's Hancock explored electro styles, capped by the song "Rockit" – a smash that straddled jazz, pop and hip-hop. It featured scratch styling and became a mainstay for breakdancing (in upper elementary school I co-created a truly amateurish school play that ended in an ensemble Rockit dance with the best breakdancers in our school). Here's Hancock's Grammy performance.

Below is a picture of Hancock from the other night with the strapped synth popularized through the song Rockit.

Hancock and Synth

Hancock did plenty more besides what I mention here, but I narrowed his contributions to just a couple to help me make my point.

His direction, especially with funk fusion and Rockit, ruffled the feathers of more than a few jazz purists. He did not mind. His response was "I have to be true to myself…it was something that I needed to do….because it takes courage to work outside the box…and yet, that’s where the growth lies”.

He also recognized that the need for progression was not just to satisfy his creative direction, but to keep the audience listening; that is, for the music, jazz, to stay alive and relevant. If someone asserts that "Rockit" was a betrayal of jazz that sacrilegiously crossed over into pop and hip-hop, I would counter argue that it opened up the world of jazz to a whole generation of pop listeners (including me). (I recognize similar developments in the genre-crossing works of recent times by Robert Glasper.)

Hancock is a perfect case study of an artist executing his craft (a) fearlessly, (b) not with the goal of pleasing everyone, (c) with the purpose of connecting with, and reaching new audiences, (d) by being open to alternative influences, (e) to achieve a harmonious melodic fusion (moving between his KORG synth, a grand piano), and (f) with constant appreciation reflection of the roots and fundamentals.

Hancock and Band

Coming Back to the Idea of the Fusion of Quant with Quant Crit in Program Evaluation

Society today presents us with situations that require critical examination of how we use the instruments on which we are trained, and an audit of the effect they have, both intended and unintended. It also requires that we adapt the applications of methods that we have honed for years. The contemporary situation poses the question: How can we expand the range of what we can do with the instruments on which we are trained, given the solutions that society needs today, recognizing that any application has social ramifications? I have in mind the need to prioritize problems of equity and social and racial justice. How do we look past conventional applications that limit the recognition, articulation, and development of solutions to important and vexing problems in society?

Rather than feeling powerless and overwhelmed, the Quant evaluator is very well positioned to do this work. I greatly appreciate the observation by Frances Stage on this point:

"…as quantitative researchers we are uniquely able to find those contradictions and negative assumptions that exist in quantitative research frames"

This is analogous to saying that a dedicated pianist in classic jazz is very well positioned to expand the progressions and reach harmonies that reflect contemporary opportunities, needs and interests. It may also require the Quant evaluator to expand his/her arrangements and instrumentation.

As Quant researchers and evaluators, we are most familiar with the "rules of playing" that reinforce "the same old song" that needs questioning. Quant Crit can give us the momentum to push the limits of our instruments and apply them in new ways.

In making these points I feel a welcome alignment with Hancock's approach: recognizing the need to break free from cliché and convention, to keep meaningful discussion going, to maximize relevance, to get to the core of evaluation purpose, to reach new audiences and seed/facilitate new collaborations.

Over the next year I'll be posting a few creations, and striking in some new directions, with syncopations and chords that try to maneuver around and through the orthodoxy – "switching up" between the "KORG and the baby grand" so to speak.

Please stay tuned.

The Band on Stage

2024-10-15

SREE 2024: On a Mission to Deepen my Quant and Equity Perspectives

I am about to get on the plane to SREE

I am excited, but also somewhat nervous.

Why?


I'm excited
to immerse myself in the conference – my goal is to try to straddle paradigms of criticality, and the quant tradition. SREE historically has championed empirical findings using rigorous statistical methods.

I'm excited
because I will be discussing intersectionality – a topic of interest that emerged from attending a series of Critical Perspectives webinars hosted by SREE in the last few years. I want to try to pay it back by moving the conversation forward and contributing to the critical discussion.

I'm nervous
because the topic of intersectionality is new for me. The idea cuts across many areas - law, sociology, epidemiology, education. It’s a vast subject area with various literature streams. I am new to it. It also gets at social justice issues that I am not used to talking about, and I want to express those clearly and accurately. I understand the power and privilege of my words and presentation and want the audience to continue to inquire and move the conversation forward.

I'm nervous
Because issues of quantitative criticality require a person to confront their deeper philosophical commitments, assumptions, and theory of knowledge (epistemology). I have no problem with that; however, a few of my experimentalist colleagues have expressed a deep resistance to philosophy. One described it as merely a “throat clearing exercise”. (I wonder: Will those with a positivist bent leave my talk in droves?)

Andrew staring at clock

What is intersectionality anyways, and why was I attracted to the idea? It originates in the legal-scholarly work of Kimberle Crenshaw. She describes a court case filed against GM:

"In DeGraffenreid, the court refused to recognize the possibility of compound discrimination against Black women and analyzed their claim using the employment of white women as the historical base. As a consequence, the employment experiences of white women obscured the distinct discrimination that Black women experienced."
The courts refusal to "acknowledge that Black women encounter combined race and sex discrimination implies that the boundaries of sex and race discrimination doctrine are defined respectively by white women's and Black men's experiences."

The justices refused to recognize that hiring practices by GM compounded discrimination across specific intersections of socially-recognized categories (i.e., Black women). The issue is obvious but can be made concrete with an example. Imagine the following distribution of equally-qualified candidates. The court judgment would not have recognized the following situation of compound discrimination:

graphic of gender and race

Why did intersectionality spike my interest in the first place? In the course of the SREE Critical Perspectives seminars, it occurred to me that intersectionality was a concept that bridged what I know with what I want to know.

I like representing problems and opportunities in education in quantitative terms. I use models. However, I also prioritize understanding of the limits of our models, with reality serving as the ultimate check of the validity of the representation. Intersectionality, as a concept, pits out standard models against a reality that is both complex and socially urgent.

Intersectionality as a bridge:

graphic on intersectionality

Intersectionality presents an opportunity to reconcile two worlds, which is a welcome puzzle to work on.

picture of a puzzle

Here’s how I organized my talk. (See the postscript for how it went.)

  1. My positionality: I discussed my background "where I am coming from": including that most of my training is in quant methods, that I am interested in problems of causal generalizability, that I don’t shy away from philosophy, and that my children are racialized as mixed-race and their status inspired my first hypothetical example.
  2. I summarized intersectionality as originally conceived. I reviewed the idea as it was developed by Crenshaw.
  3. I reviewed some of the developments in intersectionality among quantitative researchers who describe their work and approaches as "quantitative intersectionality".
  4. I explored an extension of the idea of intersectionality through the concept of "unique to group" variables: I argued for the need to diversify our models of outcomes and impacts to take into account moderators of impact that are relevant to only specific groups and that respect the uniqueness of their experiences. (I will discuss this more in another blog that is soon to come.)
  5. I provided two examples, one hypothetical, and one real that clarified what I mean by the role of "unique to group" variables.
  6. I summarized the lessons.

picture of a streetlight

There were some other exceptional talks that I attended at SREE, including:

  1. Promoting and Defending Critical Work: Navigating Professional Challenges and Perceptions
  2. Equity by Design: Integrating Criticality Principles in Special Education Research
  3. An excellent Hedges Lecture by Neil A. Lewis "Sharing What we Know (and What Isn’t So) to Improve Equity in Education"
  4. Design and Analysis for Causal Inferences in and across Studies

Postscript: How it went!

The other three talks in the session in which I presented (Unpacking Heterogeneous Effects: Methodological Innovations in Educational Research) were excellent. They included a work by Peter Halpin on a topic that I have been puzzled by for a while, specifically, how item-level information can be leveraged to assess program impacts. We almost always assess impacts on scale scores from “ready-made” tests that are based on calibrations of item-level scores. In an experiment one effectively introduces variance into a testing situation and I have wondered what it means for impacts to register at the item level, because each item-level effect will likely interact with the treatment effect. So “hats off” to linking psychometrics and construct validity to discussion of impacts.

As for my presentation, I was deeply moved by the sentiments that were expressed by several conference goers who came up to me afterwards. One comment was "you are on the right track". Others voiced an appreciation for my addressing the topic. I did feel THE BRIDGING between paradigms that I hoped to at least set in motion. This was especially true when one of the other presenters in the session, who had addressed the topic of effect heterogeneity across studies, commented: “Wow, you’re talking about some of the very same things that I am thinking”. It felt good to know that this convergence happened in spite of the fact that the two talks could be seen as very different at the surface level. (And no, people did not leave in droves.)

Thank you Baltimore! I feel more motivated than ever. Thank you SREE organizers and participants.

Picture of Baltimore.

Treating myself afterwards…

Picture of a dessert case

A special shoutout to Jose Blackorby. In the end, I did hang up my tie. But I haven’t given up on the idea – just need to find one from a hot pink or aqua blue palette.

Andrew standing by the sree banner

2024-10-04

Considerations for Conducting Research in Digital Learning Platforms

Along with Digital Promise and members of the initial SEERNet research teams, we recently authored a paper illustrating some of the mindset shifts necessary when conducting research using digital learning platforms.

Researchers with traditional backgrounds may need to think flexibly about how to frame their research questions, collaborate closely with developers, identify log data that can inform implementation, and consider iterative study designs.

The paper builds on prior publications and discussions within SEERNet and the broader DLP-as-research infrastructure movement, visit SEERNet.org for more information.

Read the full paper here

2024-07-01

AERA 2024 Annual Meeting

We had an inspiring trip to Philadelphia last month! The AERA conference theme was Dismantling Racial Injustice and Constructing Educational Possibilities: A Call to Action. We presented our latest research on the CREATE study, were able to spend time with our CREATE partners, and attend several captivating sessions on topics including intersectionality, QuantCrit methodology, survey development, race-focused survey research, and SEL. We came away from the conference energized and eager to apply this new learning to our current studies and for AERA 2025!

Thursday, April 11, 2024

Kimberlé Crenshaw 2024 AERA Annual Meeting Opening Plenary—Fighting Back to Move Forward: Defending the Freedom to Learn In the War Against Woke

Kimberle Crenshaw stands on stage delivering the opening plenary. Attendees fill the chairs in a large room, and some attendees sit on the floor.

Kimberlé Crenshaw’s opening plenary explored the relationship between our education system and our democracy, including censorship issues and what Crenshaw describes as a “violently politicized nostalgia for the past.” She brought in her own personal experience in recent years as she has witnessed terms that she coined, including “intersectionality,” being weaponized. She encouraged AERA attendees to fight against censorship in our institutions, and suggested that attendees check out the African American Policy Forum (AAPF) and the Freedom to Learn Network. To learn more, check out Intersectionality Matters!, an AAPF podcast hosted by Kimberlé Crenshaw.

Friday, April 12, 2024

Reconciling Traditional Quantitative Methods With the Imperative for Equitable, Critical, and Ethical Research

Five panelists sit on stage with a projector screen to their right. The heading on the projector screen reads Dialogue with Parents. Eleven attendees are pictured in the audience.

We were particularly excited to attend a panel on Reconciling Traditional Quantitative Methods With the Imperative for Equitable, Critical, and Ethical Research, as our team has been diving into the QuantCrit literature and interrogating our own quantitative methodology in our evaluations. The panelists embrace quantitative research, but emphasize that numbers are not neutral, and that the choices that quantitative researchers make in their research design are critical to conducting equitable research.

Nichole M. Garcia (Rutgers University) discussed her book project on intersectionality. Nancy López (University of New Mexico) encouraged researchers to consider additional questions about “street race” including “What race do you think that others assume what race you are” to better understand the role that the social construction of race plays in participants’ experiences. Jennifer Randall (University of Michigan) encouraged researchers to administer justice-oriented assessments, emphasizing that assessments are not objective, but rather subjective tools that reflect what we value and have historically contributed to educational inequalities. Yasmiyn Irizarry (University of Texas at Austin) encouraged researchers to do the work of citing QuantCrit literature when reporting quantitative research. (Check out #QuantCritSyllabus for resources compiled by Yasmiyn Irizarry and other QuantCrit scholars.)

This panel gave us food for thought, and pushed us to think through our own evaluation practices. As we look forward to AERA 2025, we hope to engage in conversations with evaluators on specific questions that come up in evaluation research, such as how to put WWC standards into conversation with QuantCrit methodology.

The Impact of the CREATE Residency Program on Early Career Teachers’ Well-Being

The Empirical Education team who presented at AERA in 2024.

Andrew Jaciw, Mayah Waltower, and Lindsay Maurer presented on The Impact of the CREATE Residency Program on Early Career Teachers’ Well-Being, focusing on our evaluation of the CREATE program. The CREATE Program at Georgia State University is a federally and philanthropically funded project that trains and supports educators across their career trajectory. In partnership with Atlanta Public Schools, CREATE includes a three-year residency model for prospective and early career teachers who are committed to reimagining classroom spaces for deep joy, liberation and flourishing.

CREATE has been awarded several grants from the U.S. Department of Education, in partnership with Empirical Education as the independent evaluators. The grants include those from Investing in Innovation (i3), Education Innovation and Research (EIR), and Supporting Effective Educator Development (SEED). CREATE is currently recruiting the 10th cohort of residents.

During our presentation, we looked back on promising results from CREATE’s initial program model (2015–2019), shared recent results suggesting possible explanatory links between mediators and outcomes (2021–22), and discussed CREATE evolving program model and how to identify/align more relevant measures (2022–current).

The following are questions that we continue to ponder.

  • What additional considerations should we take into account when thinking about measuring the well-being of Black educators?
  • Certain measures of well-being, such as the Maslach Burnout Inventory for Educators, respond to a more narrow definition of teacher well-being. Are there measures of teacher well-being that reflect the context of the school that teachers are in and/or that are more responsive to different educational contexts?
  • Are there culturally-responsive measures of teacher well-being?
  • How can we measure the impacts of concepts relating to racial and social justice in the current political context?

Please reach out to us if you have any resources to share!

Survey Development in Education: Using Surveys With Students and Parents

Much of what I do as a Research Assistant at Empirical Education is to support the design and development of surveys, so I was excited to have the chance to attend this session! The authors’ presentations were all incredibly informative, but there were three in particular that I found especially relevant. The first was a paper presented by Jiusheng Zhu (Beijing Normal University) that analyzed the impact of “information nudges” on students’ academic achievement. This paper demonstrated how personalized, specific information nudges about short-term impacts can encourage students to modify their behavior.

Jin Liu (University of South Carolina) presented a paper on the development and validation of an ultra-short survey scale aimed at assessing the quality of life for children with autism. Through the use of network analysis and strength centrality estimations, the scale, known as Quality of Life for Children with Autism Spectrum Disorder (QOLASD-C3), was condensed to a much shorter version that targets specific dimensions of interest. I found this topic particularly interesting, as we are always in the process of refining our survey development processes. Finding ways to boost response rates and minimize participant fatigue is crucial in ensuring the effectiveness of research efforts.

In the third paper, Jennifer Rotach and Davie Store (Kent ISD) demonstrated how demographics play a role in how students score on assessments. The authors explained how disaggregating the data is sometimes necessary to ensure that all students’ voices are heard. They explain that in many cases, school and district decisions are driven by average scores, often leading to the exclusion of those who are above or below the average. The authors explain that in some cases, disaggregating survey data by demographics (such as race, gender, or disability status) may be the most helpful in uncovering a different story than just the “average” will tell.

— Mayah

Sunday, April 14, 2024

Conducting Race-Focused Survey Research in the P-20 System during the Anti-Woke Political Revolt

A presentation slide titled Researcher Positionality Conceptual 
Framework shows an image of a brain, with thought bubbles that say Researching the Self, Researching the Self in Relation to Others, Engaged Reflection and Representation, and Shifting from the Self to the System

The four presentations in the symposium titled Conducting Race-Focused Survey Research in the P–20 System During the Anti-Woke Political Revolt focused on tensions, challenges, and problem-solving throughout the process of developing the Knowledge, Beliefs, and Mindsets (KBMs) about Equity in Educators and Educational Leadership Survey. On the CREATE project, where we are constantly working to improve our surveys and center racial equity in our work, we are wrestling with similar dilemmas in terms of sociopolitical context. Therefore, it was very eye-opening to hear panelists talk through their decision-making throughout the entire survey development process. The North Carolina State We-LEED research team walked through their process step-by-step, from conceptualization to the grounding literature and conceptual framing, and instrument development to cognitive interviews, and sample selection to recruitment strategies.

I particularly enjoyed hearing about cognitive interviews, where researchers asked participants to voice their inner monologue while taking the survey, so that they could understand participant feedback and be responsive to participant needs. It was also very helpful to hear the panelists reflect on their positionality and how their positionality connected to their research. I am highly anticipating reviewing this survey when it is finalized!

— Lindsay

Contemporary Approaches to Evaluating Universal School-Based Social Emotional Learning Programs: Effectiveness for Whom and How?

A screen projects a slide titled Contemporary Approaches to Evaluation SEL Programs. On the screen is a venn diagram with three circles. The three circles are labeled Skills-Based SEL, Adult Development SEL, and Justice Focused SEL. At the intersection of these three circles are bullet points with the words competencies, pedagogies, implementation, and outcomes. I was excited to attend a session focused on Social Emotional Learning (SEL), a topic that directly relates to the projects I am currently involved in. The symposium featured four papers that all highlighted the importance of conducting high-quality evaluations of Universal School-Based (USB) SEL initiatives.

In the first paper, Christina Cipriano (Yale University) presented a meta-analysis of studies focusing on SEL. This meta-analysis demonstrated that of the studies reviewed, SEL programs that were delivered by teachers showed greater improvements in SEL skills. This paper also provided evidence that programs that taught intrapersonal skills before teaching interpersonal skills showed greater effectiveness.

The second paper was presented by Melissa Lucas (Yale University) and underscored the necessity of including multilingual students in USB SEL evaluations, emphasizing the importance of considering these students when designing and implementing interventions.

Cheyeon Ha (Yale University) presented recommendations from the third paper, which underscored this point for me. The third paper was a meta-analysis of USB SEL studies in the U.S., and it showed that less than 15% of the studies it reviewed included student English Language Learner (ELL) status. Because students with different primary languages may respond to SEL interventions differently, understanding how these programs work on students based on ELL status is important and useful in better understanding an SEL program.

The final paper (presented by Christina Cipriano) provided methodological guidance, which I found particularly intriguing and thought-provoking. It highlighted the importance of utilizing mixed methods research, advocating for open data practices, and ensuring data accessibility and transparency for a wide range of stakeholders.

As we continue to work on projects aimed at implementing SEL and enhancing students’ social-emotional skills, the insights shared in this symposium will undoubtedly prove valuable in our efforts to conduct high-quality evaluations of SEL programs.

— Mayah

2024-05-30

Looking Back to Move Forward

We recently published a paper in collaboration with Digital Promise illustrating the historical precedents for the five digital learning platforms that SEERNet comprises. In “Looking Back to Move Forward,” we trace the technical and organizational foundations of the network’s current efforts along four main themes.

By situating this innovative movement alongside its predecessors, we can identify the opportunities for SEERNet and others to progress and sustain the mission of making research more scalable, equitable, and rigorous.

Read the paper here.

2024-03-27

EIR 2023 Proposals Have Been Reviewed and Awards Granted

While children everywhere are excited about winter break and presents in their stockings, some of us in the education space look forward to December for other reasons. That’s right, the Department of Education just announced the EIR grant winners from the summer 2023 proposal submissions. We want to congratulation all our friends who were amongst that winning list.

One of those winning teams was made up of The MLK Sr Community Resources Center, Connect with Kids Network, Morehouseand Spelman Colleges, New York City Public Schools, The Urban Assembly, Atlanta Public Schools, and Empirical Education. We will evaluate the Sankofa Chronicles: SEL Curriculum from American Diasporas with the early-phase EIR development grant funding.

The word sankofa comes from the Twi language spoken by the Akan people of Ghana. The word is often associated with an Akan proverb, “Se wo were fi na wosankofa a yenkyi.” Translated into English this proverb reminds us, “It is not wrong to go back for that which you have forgotten.” Guided by the philosophy of sankofa, this five year grant will support the creation of a culturally-responsive, multimedia, social emotional learning (SEL) curriculum for high school students.

Participating students will be introduced to SEL concepts through short films that tell emotional and compelling stories of diverse diaspora within students’ local communities. These stories will be paired with an SEL curriculum that seeks to foster not only SEL skills (e.g., self-awareness, responsible decision making) but also empathy, cultural appreciation, and critical thinking.

Our part in the project will begin with a randomized control trial (RCT) of the curriculum in the 2025–2026 school year and culminate in an impact report following the RCT. We will continue to support the program through the remainder of the five-year grant with an implementation study and a focus on scaling up the program.

Check back for updates on this exciting project!

2023-12-07

New Research Project Evaluating the Impact of FRACTAL

Empirical Education will partner with WestEd, Katabasis, and several school districts across North Carolina to evaluate their early-phase EIR development project, Furthering Rural Adoption of Computers and Technology through Artistic Lessons (FRACTAL). This five year grant will support the development and implementation of curriculum materials and professional development aimed at increasing computer self-efficacy and interest in STEAM careers among underserved rural middle school students in NC.

Participating students will build and keep their own computers and engage with topics like AI art. WestEd and Katabasis will work with teachers to co-design and pilot multiple expeditions that engage students in CS through their art and technology classes, culminating in an impact study in the final year (the 2026-27 school year).

Stay tuned for updates on results from the implementation study, as well as progress with the impact study.

Circuit board photo by Malachi Brooks on Unsplash

2023-11-06

Revisiting The Relationship Between Internal and External Validity

The relationship between internal and external validity has been debated over the last few decades.

At the core of the debate is the question of whether causal validity comes before generalizability. To oversimplify this a bit, it is a question of whether knowing “what works” is logically prior to the question of what works “for whom and under what conditions.”

Some may consider the issue settled. I don’t count myself among them.

I think it is extremely important to revisit this question in the contemporary context, in which discussions are centering on issues of diversity of people and places, and the situatedness of programs and their effects.

In this blog I provide a new perspective on the issue, one that I hope rekindles the debate, and leads to productive new directions for research. (It builds on presentations at APPAM and SREE.)

I have organized the content into three degrees of depth. 1. For those interested in a perusal, I have addressed the main issues through a friendly dialogue presented below. 2. For those who want a deeper dive, I provide a video of a PowerPoint in which I take you through the steps of the argument. 3. The associated paper, Hold the Bets! Do Quasi-and True Experimental Evaluations Yield Equally Valid Impact Results When Effect Generalization is the Goal?, is currently posted as a preprint on SAGE Advance, and is under review by a journal.

Lastly, I would really value your comments to any of these works, to keep the conversation, and the progress in and beneficence from research going. Enjoy (and I hope to hear from you!),

Andrew Jaciw

The Great Place In-Between for Researchers and Evaluators

The impact evaluator is at an interesting crossroads between research and evaluation. There is an accompanying tension, but one that provides fodder for new ideas.

The perception of doing research, especially generalizable scientific research, is that it contributes information about the order of things, and about the relations among parts of systems in nature and society, that leads to cumulative and lasting knowledge.

Program evaluation is not quite the same. It addresses immediate needs, seldom has the luxury of time, and is meant to provide direction for critical stakeholders. It is governed by Program Evaluation Standards, of which Accuracy (including internal and statistical conclusions validity) is just one of many standards, with equal concern for Propriety and Stakeholder Representation.

The activities of the researcher and the evaluator may be seen as complementary, and the results of each can serve evaluative and scientific purposes.

The “impact evaluator” finds herself in a good place where the interests of the researcher-evaluator and evaluator-researcher overlap. This zone is a place where productive paradoxes emerge.

Here is an example from this zone. It takes the form of a friendly dialogue between an Evaluator-Researcher (ER) and a Researcher-Evaluator (RE).

ER: Being quizzical about the problem of external validity, I have proposed a novel method for answering the question of “what works”, or, more correctly of “what may work” in my context. It assumes a program has not yet been tried at my site of interest (the inference sample), and it involves comparing the performance across one or more sites where the program has been used, to performance at my site. The goal is to infer the impact for my site.

RE: Hold-on. So that’s kind of like a comparison group design but in reverse. You’re starting with an untreated group and comparing it to a treated group to draw an inference about potential impact for the untreated group. Right?

ER: Yes.

RE: But that does not make sense. That’s not the usual starting point. In research we start with the treated group and look for a valid control, not the other way around. I am confused.

ER: I understand, but when I was teaching, such comparisons were natural. For example, we compared the performance of a school just like ours, but that used Success For All (SFA), to performance at our school, which did not use SFA, to infer how we might have performed had we used the program. That is, to generalize the potential effect of the program for our site.

RE: You mean to predict impact for your site.

ER: Call it what you will. I prefer generalize because I am using information about performance under assignment to treatment from somewhere else.

RE: Hmmm. Odd, but OK (for now). However, why would you do that? Why not use an experimental result from somewhere else, maybe with some adjustment for differences in student composition and other things? You know, using reweighting methods, to produce a reasonable inference about potential impact for your site.

ER: I could, but that information would be coming from somewhere else where there are a lot of unknown variables about how that site operates, and I am not sure the local decision-makers would buy it. Coming from elsewhere it would be considered less-relevant.

RE: But your comparison also uses information from somewhere else. You’re using performance outcomes from somewhere else (where the treatment was implemented) to infer how your local site would have performed had the treatment been used there.

ER: Yes, but I am also preserving the true outcome in the absence of treatment (the ‘business as usual’ control outcome) for my site. I have half the true solution for my site. You’re asking me to get all my information from somewhere else.

RE: Yes, but I know the experimental result is unbiased from selection into conditions at the other “comparison” site, because of the randomized and uncompromised design. I‘ll take that over your “flipped” comparison group design any day!

ER: But your result may be biased from selection into sites, reflecting imbalance on known and possibly unknown moderators of impact. You’re talking about an experiment over there, and I have half the true solution over here, where I need it.

RE: I’ll take internal validity over there, first, and then worry about external validity to your site. Remember, internal validity is the “sine qua non”. Without it, you don’t have anything. Your approach seems deficient on two counts: first from lack of internal validity (you’re not using an experiment), and second from a lack of external validity (you’re drawing a comparison with somewhere else).

ER: OK, now you’re getting to the meat of things. Here is my bold conjecture: yes, internal and external validity bias both may be at play, but sometimes they may cancel each other out.

RE: What!? Like a chancy fluky kind of thing?

ER: No, systematically, and in principle.

RE: I don’t believe it. Two wrongs (biases) don’t make a right.

ER: But the product of two negatives makes a positive.

RE: I need something concrete to show what you mean.

ER: OK, here is an instance… The left vertical bar is the average impact for my site (site N). The right vertical bar is the average impact for the remote site (site M). The short horizontal bars show the values of Y (the outcome) for each site. (The black ones show values we can observe, the white-filled one shows an unobserved value [i.e., I don’t observe performance at my site (N) when treatment is provided, so the bar is empty.]) Bias1 is the difference between the other site and my site in the average impact (the difference in length of the vertical bars). Bias2 results from a comparison between sites in their average performance in the absence of treatment.

A figure showing the difference between performance in the presence of treatment at one location, and performance in the absence of treatment at the other location, which is the inference site.

The point that matters here is that using the impact from the other site M (the length of the vertical line at M) to infer impact for my site N, leads to a result that is biased by an amount equal to the difference between the length of the vertical bars (Bias 1). But if I use the main approach that I am talking about, and compare performance under treatment at the remote site “M” (black bar at the top of Site M site) to the performance at my site without treatment (black bar at the bottom of Site N) the total bias is (Bias1 – Bias2), and the magnitude of this “net bias” is less than Bias1 by itself.

RE: Well, you have not figured-in the sampling error.

ER: Correct. We can do that, but for now let’s consider that we’re working with true values.

RE: OK, let’s say for the moment I accept what you’re saying. What does it do to the order and logic that internal validity precedes external validity?

ER: That is the question. What does it do? It seems that when generalizability is a concern, internal and external validity should be considered concurrently. Internal validity is the sole concern only when external validity is not at issue. You might say internal validity wins the race, but only when it’s the only runner.

RE: You’re going down a philosophical wormhole. That can be dangerous.

ER: Alright, then let’s stop here (for now).

RE and ER walk happily down the conference hall to the bar where RE has a double Jack, neat, and ER tries the house red.

BTW, here is the full argument and mathematical demonstration of the idea. Please share on social and tag us (our social handles are in the footer below). We’d love to know your thoughts. A.J.

2023-09-20

Multi-Arm Parallel Group Design Explained

What do unconventional arm wrestling and randomized trials have in common?

Each can have many arms.

What is a 3 arm RCT?

Multi arm trials (or multi arm RCTs) are randomized experiments in which individuals are randomly assigned to multiple arms: usually two or more treatment variants, and a control (a 3-arm RCT).

They can be referred to in a number of ways.

  • multi-arm trials
  • multi-armed trials
  • multiarm trials
  • multiarmed trials
  • multi arm RCTs
  • 3-arm, 4-arm, 5-arm, etc RCTs
  • multi-factorial design (a type of multi-arm trial)

a figure illustrating a 2-arm trial with 2 arms with one labeled treatment and one labeled control

a figure illustrating a 3-arm trial with 3 arms with one labeled treatment 1, one labeled treatment 2, and one labeled control

When I think of a multiarmed wrestling match, I imagine a mess. Can’t you say the same about multiarmed trials?

Quite the contrary. They can become messy, but not if they’re done with forethought and consultation with stakeholders.

I had the great opportunity to be the guest editor of a special issue of Evaluation Review on the topic of Multiarmed Trials, where experts shared their knowledge.

Special Issue: Multi-armed Randomized Control Trials in Evaluation and Policy Analysis

We were fortunate to receive five valuable contributions. I hope the issue will serve as a go-to reference for evaluators who want to explore options beyond the standard two-armed (treatment-control) arrangement.

The first three articles are by pioneers of the method.

  • Larry L. Orr and Daniel Gubits: Some Lessons From 50 Years of Multi-armed Public Policy Experiments
  • Joseph Newhouse: The Design of the RAND Health Insurance Experiment: A Retrospective
  • Judith M. Gueron and Gayle Hamilton: Using Multi-Armed Designs to Test Operating Welfare-to-Work Programs

They cover a wealth of ideas essential for the successful conduct of multi-armed trials.

  • Motivations for study design and the choice of treatment variants, and their relationship to real-world policy interests
  • The importance of reflecting the complex ecology and political reality of the study context to get stake-holder buy-in and participation
  • The importance of patience and deliberation in selecting sites and samples
  • The allocation of participants to treatment arms with a view to statistical power

Should I read this special issue before starting my own multi-armed trial?

Absolutely! It’s easy to go wrong with this design, but if done right, it can yield more information than you’d get with a 2-armed trial. Sample allotment matters depending on the question you want to ask. In a 3-armed trial you have to ask yourself a question: Do you want 33.3% of the sample in each of the three conditions (two treatment conditions and control) or 25% in each of the treatment arms and 50% in control? It depends on the contrast and research question. So it requires you to think more deeply about what question it is you want to answer.

This sounds risky. Why would I ever want to run a multi-armed trial?

In short, running a multi-armed trial allows a head-to-head test of alternatives, to determine which provides a larger or more immediate return on investment. It also sets up nicely the question of whether certain alternatives work better with certain beneficiaries.

The next two articles make this clear. One study randomized treatment sites to one of several enhancements to assess the added value of each. The other used a nifty multifactorial design to simultaneously tests several dimensions of a treatment.

  • Laura Peck, Hilary Bruck, and Nicole Constance: Insights From the Health Profession Opportunity Grant Program’s Three-Armed, Multi-Site Experiment for Policy Learning and Evaluation Practice
  • Randall Juras, Amy Gorman, and Jacob Alex Klerman: Using Behavioral Insights to Market a Workplace Safety Program: Evidence From a Multi-Armed Experiment

More About 3 Arm RCTs

The special issue of Evaluation Review helped motivate the design of a multiarmed trial conducted through the Regional Educational Laboratory (REL) Southwest in partnership with the Arkansas Department of Education (ADE). We co-authored this study through our role on REL Southwest.

In this study with ADE, we randomly assigned 700 Arkansas public elementary schools to one of eight conditions determining how communication was sent to their households about the Reading Initiative for Student Excellence (R.I.S.E.) state literacy website.

The treatments varied on these dimensions.

  1. Mode of communication (email only or email and text message)
  2. The presentation of information (no graphic or with a graphic)
  3. Type of sender (generic sender or known sender)

In January 2022, households with children in these schools were sent three rounds of communications with information about literacy and a link to the R.I.S.E. website. The study examined the impact of these communications on whether parents and guardians clicked the link to visit the website (click rate). We also conducted an exploratory analysis of differences in how long they spent on the website (time on page).

How do you tell the effects apart?

It all falls out nicely if you imagine the conditions as branches, or cells in a cube (both are pictured below).

In the branching representation, there are eight possible pathways from left to right representing the eight conditions.

In the cube representation, the eight conditions correspond to the eight distinct cells.

In the study, we evaluated the impact of each dimension across levels of the other dimensions: for example, whether click rate increases if email is accompanied with text, compared to just email, irrespective of who the sender is or whether the infographic is used.

We also tested the impact on click rates of the “deluxe” version (email + text, with known sender and graphic, which is the green arrow path in the branch diagram [or the red dot cell in the cube diagram]) versus the “plain” version (email only, generic sender, and no graphic, which is the red arrow path in the branch diagram [or green red dot cell in the cube diagram])

a figure illustrating the multi arms of the RCT and what intervention each of them received

a figure of a cube illustrating multi-armed trials

That’s all nice and dandy, but have you ever heard of the KISS principle: Keep it Simple Sweetie? You are taking some risks in design, but getting some more information. Is the tradeoff worth it? I’d rather run a series of two-armed trials. I am giving you a last chance to convince me.

Two armed trials will always be the staple approach. But consider the following.

  • Knowing what works among educational interventions is a starting point, but it does not go far enough.
  • The last 5-10 years have witnessed prioritization of questions and methods for addressing the questions of what work for whom and under which conditions.
  • However, even this may not go far enough to get to the question at heart of what people on the ground want to know. We agree with Tony Bryk that practitioners typically want to answer the following question.

What will it take to make it (the program) work for me, for my students, and in my circumstances?

There are plenty of qualitative, quantitative, and mixed methods to address this question. There also are many evaluation frameworks to support systematic inquiry to inform various stakeholders.

We think multi-armed trials help to tease out the complexity in the interactions among treatments and conditions and so help address the more refined question Bryk asks above.

Consider our example above. One question we explored was about how response rates varied across rural schools when compared to urban schools. One might speculate the following.

  • Rural schools are smaller, allowing principals to get to know parents more personally
  • Rural and non-rural households may have different kinds of usage and connectivity with email versus text and with MMS versus SMS

If these moderating effects matter, then the study, as conducted, may help with customizing communications, or providing a rationale for improving connectivity, and altogether optimizing the costs of communication.

Multi-armed trials, done well, increase the yield of actionable information to support both researcher and on-the-ground stakeholder interests!

Well, thank you for your time. I feel well-armed with information. I’ll keep thinking about this and wrestle with the pros and cons.

2023-05-31
Archive