The replicability issue and stereotype threat research

A recent Radiolab broadcast revisited the topic of stereotype threat (ST) in light of recent concerns about the replicability of findings in psychology and the sciences more generally. The purpose of this post is to address the replication issue in relation to stereotype threat research.

  1. Replicating ST’s effect on women and math

We start by noting that, for some time, we have been gathering resources needed for a large-scale “adversarial” replication of the early experiments testing ST’s effect on women’s math performance. This registered replication addresses the replicability of experiments by Spencer et al. done over 20 years ago. To our knowledge, there has been no attempt to replicate them that has conformed to the conditions stereotype theory specifies as necessary to produce the effect.

  1. The Situational Nature of ST 

ST is a situational predicament: being in a situation in which you are strongly invested, where you know you could be negatively stereotyped based on one or more of your social identities—age, religion, gender, race, etc. It’s not a person characteristic like, say neuroticism, that goes with the person from situation to situation. It might make sense to say “how big an effect does being neurotic have in people’s lives?” in the sense of how big an effect does it have across situations in a person’s life. But ST is not a person characteristic. It is a situational predicament or pressure that doesn’t exist when the elements of the predicament are not present, but that can have big, even life-shaping effects when those elements are present, especially when they are persistently present in a domain of life that is important to the person. 

Thus, more meaningful questions are: is this predicament strong and important for women in elite STEM coursework, or for girls in 6th grade math, or for girls in Poland where gender math stereotypes are weak, or for whites interacting with blacks when the topic is race, or for whites interacting with blacks when the topic is sports, or for blacks at elite colleges, or for blacks at state colleges, and so on. In all of these examples, the question about ST is whether the predicament that it is, is a significant pressure for the group being referenced in the situation. There is no doubt that you can produce and replicate ST effects—on meaningful dependent measures like intellectual performance, inter-group trust, social comfort in intergroup interaction, physiological and brain activity measures—by creating situations in which this pressure is strong. You can do this with the ease of a classroom demonstration. Thus, the question of whether ST effects “exist” per se, or the question of whether ST effects are “replicable” per se, is hardly contestable. For the “existence” and “replicability” question to be more meaningful, you have to ask for whom and where is ST a meaningful pressure; for women in advanced college STEM classes? for Muslims boarding an airplane? for white players competing against a predominantly black basketball team? Addressing these questions can be informative; implicitly testing the “existence” and “replicability” questions and telling us how big a role ST plays in situations of particular concern.

  1. The Size of ST Effects

The question “How big are ST effects?” has occupied many recent discussions. Here theory is everything; experimental effects of ST should be strong in situations where this predicament (specified by ST theory) is strong, and weak in situations where this predicament (specified by ST theory) is weak. ST theory predicts that ST effects will be strong when: a) the person is identified with a domain of performance or functioning in which he or she can be negatively stereotyped (based on stereotypes about one or more of his or her social identities); b) the performance or functioning is frustrating enough to make the negative stereotype possibly applicable to the person as an account of that frustration; and c) features of the situation and culture suggest some possibility of being stereotyped or, at least, do not quell that possibility. When these specifications are met in an experimental situation, the ST predicament is strong and its effects on behavior (e.g., performance) and physiological functioning should be significant. When these specifications are not met–one or more of these features is absent or weak–the ST predicament is weak, and its effects should be weak or absent.

A meta-analysis (looking at over 350 studies) is on the horizon that will assess this theoretical prediction across the ST literature more precisely than has been done to date.

But for now, in assessing whether or not a study “tests” a given ST effect, it is essential to know whether the study included an experimental condition that meets the theory-specified conditions for producing ST, and a control condition that doesn’t. For example, a study we’ve seen cited as a “failure to test” ST is Stricker & Ward (2004). This study contrasted conditions in which test-takers completed demographic information (race in one study, gender in another study) either before or after taking high stakes standardized tests in real-life settings. Providing this information before the test was presumed to be the ST condition (priming the relevance of the identity stereotype); providing it after the test was presumed to be the no-ST condition (since that way, they might take the test without the identity stereotype being primed). They found a modest effect. But here’s the problem: ST theory would predict little or no ST effect here because both conditions—the reputed ST condition and the reputed no-ST condition—have the features the theory says would produce the effect: test-takers invested in doing well on the test; a high stakes test that would produce enough frustration to itself evoke the stereotype as a possible account of their experience; and an American context in which minority and women test-takers would know that their weak performance could be seen as stereotype-confirming. Put simply, both the ST and the no-ST condition would be expected to cause ST, so that not getting a difference between these condition, or only a weak one, is not a “failure to replicate” a ST effect, it is, essentially, a failure to test ST. The theory of the ST predicament has to be kept in mind when assessing the replicability of ST effects. These effects should replicate under conditions predicted to produce the effect, but otherwise not.

  1. The information value of a single replication failure 

When an effect found in study A doesn’t replicate in study B, it raises the possibility that the original effect happened by chance, or was due to poor research practices. When there are few such demonstrations, a failed replication is generally more indicting than when there are many prior demonstrations of the effect. The logic here is Bayesian: new information (a replication failure) adjust our prior beliefs about an effect more when that prior belief is based on little information than when it is based on lots of information. Moreover, in a large literature, one must expect some replication failures on the basis of chance alone. Still, following a Bayesian process, this study’s failing to replicate the earlier studies, would affect our expectations about the strength of this effect going forward.

For example, there are so many demonstrations of dissonance-reducing attitude change, over such a vast array of circumstances and procedures, that the failure to replicate a particular dissonance experiment in a particular circumstance wouldn’t strongly imply that dissonance reduction is not a real phenomenon. What this outcome would mean is hard to say. Perhaps it reflects a boundary condition of the original effect? Perhaps it happened by chance? (In a large literature, one must expect some replication failures on the basis of chance alone.) Perhaps it reflects that the methodologies used in the original research have essentially timed out, that is, not having the same meanings in this new era, they no longer produce the effect. Perhaps it reflects an experimenter’s inattention to context or critical detail? Against a large literature of successful replications and conceptual replications, particular replication failures have enough alternative explanations as to make them difficult to interpret—perhaps too many to convincingly indict the earlier effects, even as, following Bayesian logic, it should adjust one’s expectations downward. A caution light, not a stop light. Yet when there are few prior replications or conceptual replications, a failed replication is more suggestive.

One could say that even large literatures can be so corrupted by questionable research practices like p-hacking, that the size of a literature is no protection against its being uninterpretable. Clearly these practices can produce false results, and anything is possible. But such accounts often assume the dominance over research of particular investigator motives; namely, those aroused by intense pressures to publish. These are undeniable. But they aren’t the only investigator motives. There are also genuine scientific motives such as to expand human understanding and to develop useful, real-world interventions for important problems—all of which pressure for careful research practices that produce truly interpretable results. Seeing a large literature such as that of cognitive dissonance–and if we dare add, ST–as corrupted beyond interpretability simply because the motive for corruption exists (to publish) and the means exists (e.g., p-hacking) is to disregard the impact of other investigator motives and to embrace, in our view, an unlikely possibility.

  1. Exact vs Conceptual Replications as Tests of an Effects Replicability

In social psychology the meaning of our research materials, protocols and operations are tied to the culture, context and era in which the research was done. The same materials and procedures that evoke a process at one time, in one setting, may not evoke it at another time in a different setting—their meaning may have “timed out,” a new context may change their meaning, etc. To evoke the process in a second situation, may require different materials and operations, ones that have been shown, largely through pretesting, to create the conditions theorized to produce the effect in that second situation. That is, one may have to “conceptually replicate” the first experiment. If, proceeding this way, one finds the effect, then the phenomenon or process under study has been shown to generalize to a different situation and, in that sense, to be replicable.

Doing an “exact replication” in that second situation without showing first that the materials and procedures produce the conditions theorized to evoke the process under study, is ill advised. If it fails to replicate the original effect, one doesn’t know whether the failure is due to the phenomenon being unreal, or due to materials and procedures that, in that second situation, simply didn’t evoke the processes critical to getting the effect.

For example, the same experimental materials and procedures that evoke cognitive dissonance and self-justifying dissonance reduction in New York may not do so in Tokyo. To test whether or not cognitive dissonance can be aroused in the Japanese of Tokyo, one will have to use materials and procedures shown, among the Japanese of Tokyo, to produce the conditions—an important self-contradiction–theorized to produce cognitive dissonance. This will likely involve different materials and operations than those used in New York. One will have to conceptually replicate the experiment done in New York. Proceeding this way, if one finds dissonance reduction among the Japanese of Tokyo, then the original New York study has been conceptually replicated—its finding of dissonance reduction in response to an important self-contradiction shown to be replicable and to generalize to the Japanese of Tokyo. If this conceptual replication fails, it suggests that culture and the way it organizes self-functioning may be different in Tokyo than in New York. Perhaps the process of dissonance reduction itself is somehow culturally moderated. Interesting questions arise. But the point is that the conceptual replication leads to an interpretable result, not the exact replication of the New York study.

  1. ST transitions out of the laboratory into the field.

Especially relevant to questions about the strength and reliability of ST effects is the fact that ST research has transitioned out of the laboratory into the real world where the question is how big is this effect in important areas of real life? The intervention research addresses this question by asking how big an improvement in the real life academic performance of ability-stereotyped groups can one get by doing something that reduces the ST pressure they are under? And the answer very often is “a lot, for a long time, with very feasible efforts.” Over 37 such interventions (many RCT designs with large sample sizes) have now been done, and often replicated by independent investigators. Cumulatively now, this work has opened up a new framework for addressing some of the nation’s most tenacious educational challenges.

More research needs to be done in such field settings, and larger more comprehensive samples should be part of this effort. That said, the studies that have been done strongly encourage the view that even modest efforts to reduce ST can have meaningful effects on people’s real-world outcomes in school and society more generally–and this, after all, is where the tire of social psychological research and theorizing hits the road of affecting real life outcomes. This is not to suggest that ST reduction is the sole answer to this class of challenges. But a truly impressive body of research on this approach now suggest that it can be an important part of the solution.

  1. A corrective era

The current focus on the replicability of research findings in social psychology constitutes an important corrective era in the field. Good developments have followed: a focus on eliminating questionable research practices, an effort to use larger sample sizes, systems for pre-registering experiments, a concern about multiple dependent measures, greater stress on sharing data and so on. We’ve had corrective era’s before. One of us remembers, for example, the depth of concern about experimenter bias in the late 60’s, about small-n demonstrational field studies in the 70’s, or even the struggle to establish experimentation as the methodology of preference again in the 60’s—the origins of SESP, the Society of Experimental Social Psychology. A science maturing. We endorse the current era. Our purpose has been to discuss its significance for interpreting ST research, and to offer a broader scientific context for these concerns, one that is assuring as well as corrective.