Introduction

We articulated several concerns with the Gabbiadini et al. (2016) analysis in our original article and provided a commentary on the importance of preregistration for addressing many of our concerns about the literature concerning media effects on youth outcomes. We thank the original authors for taking the time to respond to our concerns. However, we were unpersuaded by their commentary and we believe it inadvertently underscores a major limitation of their procedures and dataset. In this reply, we identify areas of continued concern and provide some ideas for a more productive future.

Random Assignment and the Importance of Accurate Descriptions of Procedures

There is a strong relation between age and condition in the Gabbiadini et al. (2016) dataset. Nothing in Gabbiadini et al.’s comment adequately explains why this connection exists. Moreover, we are unconvinced that using age as a covariate can remedy the apparent failure of randomization to create experimental groups that are more or less equated on background characteristics. More importantly, we are dismayed by their multiple and subsequently opaque accounts about how youth were assigned to video game conditions.

  1. 1.

    13 April 2016. In the initial article, the authors stated “Participants (N = 154) were randomly assigned to play a violent/sexist game, a violent-only game, or a non-violent game” (p.1, repeated p. 4). No mention was made of any assignment at the level of classroom.

  2. 2.

    20 April 2016. In the comments section for their article, the authors replied to a concern from another scholar: “…the data collection lasted 15/17 days. Every day we used a different game. For example, the first day, participants planned to take part in the study in that very day played with GTA, the next day the participants planned for that day pleayed [sic] PINBALL, on the third day participants played half life [sic]… and so on.” (Interested readers are able to track these comments and responses on the PLOS One website: http://journals.plos.org/plosone/article/comments?id=10.1371/journal.pone.0152121).

  3. 3.

    21 April 2016. In a follow up response in the comments to more queries from the same independent scholar: “The games were randomized accross [sic] the experimental sessions. For example, if the class 2A consisted of 18 students, then the first 6 participants played GTA San Andreas/GTA vice city, the second group of the same class played PINBALL/QUBE while the third group played Half Life 1/Half life 2 [sic]. The order of the games was randomized during the days. We did not know the order in which the various classes were intended to participate in the study. There was a referent teacher who decided the order of the classes (i.e., first day ->2 A, second day 4 A….) Thus, at least 1/3 of class 2A played a violent and sexist game, 1/3 of the class played a neutral game and 1/3 payed [sic] a violent game.”

  4. 4.

    5 April 2017. An additional comment by the Gabbiadini group: “Moreover, it is noteworthy that our research was conducted in a real context (i.e., high school) rather than in a lab context. thus, the practical limitations of this context did not allow us to assign participants to group conditions in a complete randomized way and the randomization process was dependent on participants’ classes. crucially, participants’ age, when entered as covariate, did not affect the pattern of our results.”

  5. 5.

    In their comment on our reanalysis the Gabbiadini group state: “…we had to randomly assign classrooms to conditions” and note that the internal committee of the high school gave them “only 1 week to collect all data”.

  6. 6.

    22 August 2017. In response to an email from us about the procedures, Dr. Gabbiadini replied: “Classrooms were randomized across the three experimental conditions, and we tried our best also to randomize the type of video game across participants in each classroom, but this was not always possible…A total of 9 classrooms were involved in the data collection. We have not recorded an ID for each classroom because we did not know which classroom were entering in the lab. This aspect was managed by the high school organization for privacy protection of participants.”

These shifting descriptions represent an unacceptable degree of imprecision in reporting the procedures of a scientific study. More importantly, the statements provided by the authors fail to adequately explain how age could be so strongly related to condition (e.g., 100% of 15-year-olds played Grand Theft Auto games). Such a pattern seems unlikely under explanation #3 unless there was a strict ordering of students by age within a classroom and that such ordering perfectly matched the shifting ordering of the video game assignments. The last explanation from the authors is far less definitive as they acknowledge that they tried their best to randomize the type of video game across participants but such a process was not always possible. We are not sure what that means about the exact procedures used to assign participants to conditions but such a statement is at odds with earlier claims of random assignment by participant, by classroom, or by serial position within a classroom.

In short, the straightforward description of the procedures in the original report is incorrect. Participants were not simply randomly assigned to condition and this means that the Gabbiadini et al. dataset has a complicated nested structure. However, there is no way to properly account for the nested structure of the dataset in a multilevel model given the information collected by the authors. The originally reported p-value for the focal interaction is likely to change if the correct standard error could be computed given the design. Given that there is no way to correctly model these data, we believe that it is reasonable to consider removing this study from the scientific record. As noted earlier, we do not believe that including age as covariate is an appropriate solution to this issue.

Other Methodological Issues

Even if readers are willing to discount the ambiguities about procedures in the original study, there are other methodological concerns with the Gabbiadini et al. (2016) study. An overarching point of our article was to illustrate how there was a large number of analytic possibilities in the original dataset and to highlight the potential to capitalize on researcher degrees of freedom or to otherwise take a walk down the “Garden of Forking Paths” to borrow a term from Andrew Gelman and Eric Loken. For example, there are cases where using the subjective rating of game violence made by participants can turn non-significant results into statistically significant results. Likewise, there were a number of variables and items in the dataset which could be used to operationalize study constructs.

One issue that Gabbiadini and colleagues noted in their comment was that of the “avatar identification” variable. We originally pointed out that Gabbiadini et al. (2016) did not disclose that they had measured three separate avatar identification variables using items from a survey developed by Van Looy and colleagues (2012). This lack of disclosure is part of our concern about analytic flexibility in the original study. In their comment, Gabbiadini et al. also note they did not include a further subscale stating, “To keep our study within the 1-h time limit given to us by the high school, we dropped the Similarity Identification subscale, which is defined as ‘the degree to which the player sees their avatar as similar to him/herself.’ This subscale was considered less relevant because it is typically used for MMORPG (Massively Multiplayer Online Role Playing Games) virtual environments rather than stand-alone games like the ones we used.” They also provided some analyses to suggest why the subscale they reported (embodied presence), rather than the two they did not disclose in the original article (wishful identification, character empathy) was theoretically ideal.

For this reply, we reached out to the lead author of Van Looy et al. (2012) (Van Looy, personal communication 23 August 2017). Dr. Van Looy disagreed with the assessment of Gabbiadini et al. in their comment on our article. Dr. Van Looy noted that the excluded Similarity Identification subscale, in fact, would have been essential to study to fully understand avatar identification. Dr. Van Looy noted that, “Stating that this subconstruct only relates to MMOs is incorrect.”

We still consider it problematic that Gabbiadini et al. (2016) measured three related constructs, but only reported results for the statistically significant variable and not two others with null results. We are also worried that the original description in the published article could be read to indicate that the embodied presence variable was the only variable collected. It also appears that Gabbiadini et al. actually excluded a fourth identification variable that would likely have been a critical subscale to use to assess their construct of interest. Thus, arguing over which of the three other variables should have been included in the PROCESS models may be rather moot in light of this fact. The existing PROCESS models are a mixture of significant and non-significant interactions, none of which involve variables that are maximally suited for measuring avatar identification.

Gabbiadini et al. present a new PROCESS model (Model 4), which they claim demonstrates a simple mediation effect. However, when we reanalyzed this model, we discovered that it has the same problems of inconsistency and unreliability that we articulated in our original article about the Gabbiadini et al. (2016) analysis. Specifically, when age, gender and game frequency are entered as covariates into this model, the focal paths become non-significant. Only when the violence rating covariate is entered do results become significant again. Thus, use of Model 4 presents no more conclusive evidence for a mediation effect than other models. It appears that a few simple covariates (age, gender, frequency) render the model non-significant.

On “Sexist” Games

In their comment, Gabbiadini take issue with our concerns about the use of “sexist” to define certain classes of games. Foremost, we should note that we never claimed that the Grand Theft Auto (GTA) series is devoid of content that many would find sexist (We noted that there is the potential for exposure to sexist content with GTA games in our article). One might read their comment as implying that we staked a strong position on the nature of the GTA series. Our concerns were more about the sandbox nature of the games in general and the use of the term sexist in the media effects literature.

In their comment, Gabbiadini and colleagues point to other studies that provide further evidence for “sexist” game effects. It is worth noting that they seem disinterested in citing studies with null (e.g., Breuer et al. 2015) or ambiguous (Stermer and Burkley 2015) results. However, the studies provided as evidence by Gabbiadini and colleagues do not appear to make as strong a case for the causal effects of sexist games as portrayed in their comment.

For instance, Gabbiadini and colleagues point to two studies from Ohio State (Fox et al. 2013, 2014) suggesting that the use of sexualized avatars in a video game result in higher rape myth acceptance among women. However, neither study actually involved a video game, instead making use of avatars in social situations. Further, in the first study (Fox et al. 2013), rape myth acceptance was actually lowest among women using a sexualized avatar without their own face, even compared to non-sexualized control groups. One could actually make the argument from this study that it is better to use sexualized avatars so long as they do not have one’s own face. However, both studies in reality have potential flaws such as rather blatant demand characteristics, a problem unfortunately common to media/body image research (Ferguson 2013a; Want 2014; Whyte et al. 2016).

Evidence from another study cited by Gabbiadini and colleagues (Dill et al. 2008) likewise proves difficult to interpret on closer scrutiny. Once again, participants did not actually play video games, rather they were exposed to PowerPoint slides. Individuals did not appear to have been randomized to condition, with classes of individuals randomized instead. Introductory psychology courses were used and demand characteristics in the experiment appear to be present. Nonetheless, experimental results were inconsistent, with small effects found for one outcome (judgements of sexual harassment) but not the other (rape supportive attitudes). The finding for sexual harassment was also only for males, with females exposed to sexualized images actually lowest in their tolerance toward sexual harassment (a finding that is diametrically opposed to those in the Fox et al. 2014 studies). A survey of violent video game exposure was correlated with both outcomes, but once this variable was entered into the main factorial ANOVA analyses of the study, the influence of prior violent video game use became non-significant. This inconsistent set of results from a fairly weak design provide less than compelling evidence for sexist game effects.

The final article cited by Gabbiadini and colleagues (Helfgott 2015) is a review article. Much of its central theses regarding video games or other media causing copycat crimes has since been discredited (Surette 2013; Surette and Maze 2015).

Thus, the articles cited by Gabbiadini and colleagues do not provide an especially strong framework to argue for game effects on sexism as none of them involve commercial video games as stimuli at all. The results from these studies are not always consistent and there are potential concerns with the designs. Over-interpreting results and a failure to acknowledge null effects from other studies has been a consistent problem for media effect research, where rhetoric often outstrips the available data (Markey et al. 2015).

Similarly, Gabbiadini and colleagues claim that “In fact, a scientific consensus is beginning to emerge around the potentially harmful effects of sexist violent video games on players”. A best, such a claim is an argument to consensus logical fallacy that one could interpret as pressure for scholars to conform to the right way of thinking on a moral issue. Until recently, it was common to hear advocates of causal effects for violent video games claim consensus, although subsequent surveys of scholars disputed this notion (e.g., Bushman et al. 2015a; Ferguson and Colwell 2017; Quandt et al. 2015 but also see Ivory et al. 2015 and Etchells & Chambers 2014 for critical comments on the Bushman et al. 2015a article.) Such claims of consensus tend to reflect moral advocacy agendas, not the product of good science.

On Strawpeople and Sensationalism

Gabbiadini and colleagues believe we created a strawperson argument about their work in our original article. They seemed aggrieved by any suggestion that their null result for the direct effect of video games on reduced empathy to women was meaningful. We did not spend a great deal of time on this issue in our original article but it is worth considering whether we actually constructed a strawperson argument. To be frank, we suspect that Gabbiadini and colleagues could have predicted direct effects for GTA games on empathy based just on their response to our comment. For example, in their comment under the heading “Is playing with sexist video games just harmless fun?”, they cite previous studies which seem to report direct effects of media exposure on variables such as attitudes about the acceptance of rape and attitudes supportive of violence toward women. These variables do not seem terribly different than the empathy variable in their dataset. Given the alleged (but perhaps misrepresented) results of prior studies which they seem to endorse, why wouldn’t researchers expect that the sexist game condition would have an impact on the empathy variable they collected?

Indeed, we think it is notable that the straightforward ANOVA results do not support a main effect of condition on reduced empathy in light of the literature Gabbiadini and colleagues cited in their reply. To their credit, this null result is reproducible and stated in the original report. However, this exact issue about prior predictions concerning direct vs. indirect effects underscores the value of preregistration, a major theme in our original article. Again, to be frank, we have no idea what a priori model guided Gabbiadini et al. (2016) because there is no record of their planned analyses and measurement strategy.

Our broad point was that a pre-registered analytic plan renders p-values meaningful, constrains researcher degrees of freedom, and eliminates concerns about hypothesizing after the results are known (i.e., HARKing; Kerr, 1998). Nothing in the reply by Gabbiadini and colleagues changes this reality about the virtues of preregistration. They spend some time incorrectly noting when we as researchers started preregistering (e.g., Ferguson, for instance, has been preregistering studies since 2014, not 2017; e.g., Ferguson et al. 2015; and Donnellan was a co-author on two preregistered articles which were published in 2014; i.e., Johnson et al. 2014; Lynott et al. 2014) but these inaccuracies do not undermine the fundamental benefits of preregistration.

Further, Gabbiadini and colleagues also seem to disavow ever claiming direct effects between video games and sexist attitudes. They state “To our knowledge, no media-violence researcher has ever made such a claim.” However, language implying direct and powerful effects are, in fact, quite common in this literature. We present several examples in Table 1.

Table 1 Sensationalist claims by scholars liking “Sexist” games to sexist attitudes or behaviors in real life

Admittedly, in some cases researchers may briefly mention other variables. For instance, in a press release proclaiming “Sexist video games decrease empathy for female violence victims” as its headline, coauthor Dr. Brad J. Bushman later noted, “Most people would look at these images and say the girl pictured has to be terrified. But males who really identified with their characters in the sexist, violent games didn’t feel as much empathy for the victim.” Later in the press release Dr. Bushman added, “If you see a movie with a sexist character, there’s a certain distance. But in a video game, you are physically linked to the character. You control what he does. That can have a real effect on your thoughts, feelings, and behaviors, at least in the short term…You may think the games are just harmless fun. But when boys play them and identify with the male characters in the game, it can lead to agreement with some pretty disturbing beliefs about masculinity and how to treat women.”

In the above example one of the mediating variables (character identification) gets mentioned. Nevertheless, it is unclear that this brief mention does anything to obviate the overall implication of direct public health worthy “harm” effects. This may be an element of what has sometimes been called the Yes I Said It, No I Didn’t phenomenon in which scholars may both assert clear, alarming effects, but also include some vague qualifiers that can be used as cover if they are called out for alarmism by critics (see, for example the exchange between Markey et al. 2015; then Bushman et al. 2015b; finally Markey et al. 2015). Again to be frank, it can be easy to accuse critics of creating strawperson arguments in such a context. We also echo the concerns of Markey et al. (2015) that scholars often imply direct, causal, public-health level effects of video games that cannot be supported by data from their studies.

A Way Forward

In general, we are concerned that the study of “sexist” video games may be following the path of research about video games and violence. This includes using emotionally evocative labels for games, inflated rhetoric regarding the strength and consistency of effects, exaggerated press releases, and weak research designs (Ferguson 2013b; Hall et al. 2011; Markey et al. 2015). The violent video game field also appears to be touched by psychology’s replication crisis, with direct replications of older studies ultimately failing to confirm previous results (e.g., Przybylski et al. 2014; Tear and Nielson 2013, 2014) and suggestions of publication bias in the existing literature (Hilgard et al. 2017). We worry that a similar state of affairs is emerging for “sexist” video game research, and it may be a facet of the larger moral panic facing video games and games research (Bowman 2016; Ferguson 2013b). We also understand that there is occasionally good theater (and even fun) in academic debates and consider the response by Gabbiadini and colleges in such a context. Nonetheless it is important to move beyond theatrics so we conclude with some constructive comments for how the field and consumers of research about media effects on youth outcomes may move forward.

Preregistration

As we have indicated both in this reply and in our original article, preregistration has numerous benefits. It provides confidence that complex moderator/mediator analyses were planned in advanced and are not the result of HARKing or Garden of Forking Paths issues. The current back and forth between our respective groups would be much different if there was verifiable proof of the theoretical model initially guiding the original study as well as a detailed analytic plan.

We are also concerned that theories in media effects are often slippery and shifting when it comes to direct effects vs. moderated mediation and what variables count as mediators as opposed to ultimate dependent variables. We worry there is a lot of analytic flexibility with existing studies (e.g., Elson et al. 2014) and we suspect it is fairly easy to find at least some effect that is p < .05 (e.g., Simmons et al. 2011). Collectively, these conditions make many claims about media effects difficult to falsify. Preregistration along with a commitment to publishing null effects would help produce a literature with unbiased effect size estimates. That kind of literature may actually prove useful to parents and those concerned about youth development. Our fear is that the current literature is biased so that it is not especially well suited for drawing real world implications.

21-Word Solution

Simmons et al. (2012) suggest a simple disclosure statement that can be used to make sure all variables are reported in a published article. They call this a 21-word solution to some of the problems of analytic flexibility. The statement reads: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study” and can be included (when true) in the Method section of all articles. The inclusion of this statement in research would further help address concerns with analytic flexibility such as the specific concerns about the omitted reporting in Gabbiadini et al. (2016) we have identified. Consumers of studies can look to this as an additional signal of the quality of the work. The 21-word solution could not have been applied to the Gabbiadini et al. (2016) report as it was written.

Separating Advocacy from Science

We support advocacy pushing for better representations of female characters in video games, and salute some recent positive moves in this direction (e.g., the Tomb Raider reboot, Horizon Zero Dawn; Alice: Madness Returns, Portal; Going Home and Beyond Good and Evil.) We also believe that sexist attitudes and practices are deplorable. However, advocacy and science are distinct with different objectives and different evidentiary requirements. Advocacy is about changing practices and attitudes whereas sciences is ultimately about figuring out reality. Advocates often emphasize information that supports a particular goal while they may deemphasize or even omit information that does not support a particular position. Advocacy can be fueled by explicitly moral agendas. Science searches for truth however convenient or inconvenient for any particular agenda or perspective. In many cases, combining advocacy with science may prove detrimental to both efforts.

Advocacy is important for drawing attention to sexist representations in games and motivating designers to change the depictions of women in games. Likewise, pointing to disparities in gender representation among game designers, or the harassment faced by female gamers are worthwhile efforts. To the extent that advocates rest their arguments on the existence of causal media effects, they risk making claims based on shaky grounds. Concerns that evidence cited in these arguments are “cherry-picked” or discredited by other research could inadvertently harm well-intentioned advocacy efforts to the extent that they lose credibility.

We argue that science remains most effective when it remains neutral insofar as advocacy efforts are concerned. We understand that many scholars may wish to put their data to use in support of various efforts to better the human condition. However, we struggle to think of multiple examples where mixing advocacy with science does not damage the objectivity of the latter. This has been a verified problem for some video game violence research where some scholars associated closely with or received research funding from anti-media advocacy groups (Ferguson 2013b). These mistakes should not be repeated with sexist media research.

Conclusion

The main point of our original article was to draw attention to an issue with random assignment in the Gabbiadini et al. (2016) article and to point to potential examples of analytic flexibility in that article. We concluded that the evidence in support of the main thesis advanced by the authors was weaker than presented in the article. We believe that our points extolling the virtues of preregistration and demanding strong evidence are important for those interested in scientific topics that relate to youth development. Nothing in the response by Gabbiadini and colleagues undermines our main points. If anything, the further details about the original procedures further undermine the strength of the evidence in the original report. It might literally be impossible to calculate appropriate standard errors given the dependencies produced by their method of assignment of participants to conditions.