Abstract
Studies in socio-technical aspects of security often rely on user studies and statistical inferences on investigated relations to make their case. They, thereby, enable practitioners and scientists alike to judge on the validity and reliability of the research undertaken.
To ascertain this capacity, we investigated the reporting fidelity of security user studies.
Based on a systematic literature review of 114 user studies in cyber security from selected venues in the 10 years 2006–2016, we evaluated fidelity of the reporting of 1775 statistical inferences using the R package statcheck. We conducted a systematic classification of incomplete reporting, reporting inconsistencies and decision errors, leading to multinomial logistic regression (MLR) on the impact of publication venue/year as well as a comparison to a compatible field of psychology.
We found that half the cyber security user studies considered reported incomplete results, in stark difference to comparable results in a field of psychology. Our MLR on analysis outcomes yielded a slight increase of likelihood of incomplete tests over time, while SOUPS yielded a few percent greater likelihood to report statistics correctly than other venues.
In this study, we offer the first fully quantitative analysis of the state-of-play of socio-technical studies in security. While we highlight the impact and prevalence of incomplete reporting, we also offer fine-grained diagnostics and recommendations on how to respond to the situation.
Preregistered at the Open Science Framework: osf.io/549qn/.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
References
American Psychological Association (ed.): Publication Manual of the American Psychological Association, 6th revised edn. American Psychological Association (2009)
Coopamootoo, K.P.L., Groß, T.: Cyber security and privacy experiments: a design and reporting toolkit. In: Hansen, M., Kosta, E., Nai-Fovino, I., Fischer-Hübner, S. (eds.) Privacy and Identity 2017. IAICT, vol. 526, pp. 243–262. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92925-5_17
Coopamootoo, K., Groß, T.: Systematic evaluation for evidence-based methods in cyber security. Technical report TR-1528, Newcastle University (2017)
Coopamootoo, K.P.L., Groß, T.: Evidence-based methods for privacy and identity management. In: Lehmann, A., Whitehouse, D., Fischer-Hübner, S., Fritsch, L., Raab, C. (eds.) Privacy and Identity 2016. IAICT, vol. 498, pp. 105–121. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-55783-0_9
Cumming, G.: Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, New York (2013)
Elson, M., Przybylski, A.K.: The science of technology and human behavior - standards old and new. J. Media Psychol. 29(1), 1–7 (2017). https://doi.org/10.1027/1864-1105/a000212
Epskamp, S., Nuijten, M.B.: statcheck: extract statistics from articles and recompute p values (v1.3.0), May 2018. https://CRAN.R-project.org/package=statcheck
Fox, J., Andersen, R.: Effect displays for multinomial and proportional-odds logit models. Sociol. Methodol. 36(1), 225–255 (2006)
Lakens, D.: Checking your stats, and some errors we make, October 2015. http://daniellakens.blogspot.com/2015/10/checking-your-stats-and-some-errors-we.html
LeBel, E.P., McCarthy, R.J., Earp, B.D., Elson, M., Vanpaemel, W.: A unified framework to quantify the credibility of scientific findings. Adv. Methods Pract. Psychol. Sci. 1(3), 389–402 (2018)
Maxion, R.: Making experiments dependable. In: Jones, C.B., Lloyd, J.L. (eds.) Dependable and Historic Computing. LNCS, vol. 6875, pp. 344–357. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24541-1_26
Moher, D., et al.: CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. J. Clin. Epidemiol. 63(8), e1–e37 (2010)
Nuijten, M.B., van Assen, M.A., Hartgerink, C.H., Epskamp, S., Wicherts, J.: The validity of the tool “statcheck” in discovering statistical reporting inconsistencies (2017). https://psyarxiv.com/tcxaj/
Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M., Epskamp, S., Wicherts, J.M.: The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48(4), 1205–1226 (2015). https://doi.org/10.3758/s13428-015-0664-2
Peisert, S., Bishop, M.: How to design computer security experiments. In: Futcher, L., Dodge, R. (eds.) WISE 2007. IAICT, vol. 237, pp. 141–148. Springer, New York (2007). https://doi.org/10.1007/978-0-387-73269-5_19
Ripley, B., Venables, W.: nnet: feed-forward neural networks and multinomial log-linear models, February 2016. https://CRAN.R-project.org/package=nnet
Schechter, S.: Common pitfalls in writing about security and privacy human subjects experiments, and how to avoid them (2013). https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/commonpitfalls.pdf
Schmidt, T.: Sources of false positives and false negatives in the STATCHECK algorithm: reply to Nuijten et al. (2016). https://arxiv.org/abs/1610.01010
Acknowledgment
We would like to thank Malte Elson for the discussions on statcheck, on the corresponding analyses in psychology, and on general research methodology. We thank the anonymous reviewers of STAST 2019 for their discussion and insightful comments, as well as the volume co-editor Theo Tryfonas for offering additional pages to include the requested changes.
This study was in parts funded by the UK Research Institute in the Science of Cyber Security (RISCS) under a National Cyber Security Centre (NCSC) grant on “Pathways to Enhancing Evidence-Based Research Methods for Cyber Security” (Pathway I led by Thomas Groß). The author was in parts funded by the ERC Starting Grant CASCAde (GA no716980).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Details on Qualitative Analysis
1.1 A.1 Errors Committed by statcheck
Parsing Accuracy. In all 34 error cases, statcheck parsed the PDF file correctly, and its raw test representation corresponded to the PDF. In all but two tests, statcheck recognized the test correctly. In said two cases, it mistook a non-standard-reported Shapiro-Wilk test as \(\chi ^2\) test, creating two false positives. There was one case in which the statcheck computed p-value for an independent-samples t-test differed slightly from our own calculation, yet only marginally so, presumably because of a unreported Welch correction.
One-Tailed Tests. In seven cases, statcheck recognized one-tailed tests correctly. For three of those tests, the authors framed the hypotheses as one-tailed. In three other tests, the authors used one-tailed test results without declaring their use. There was one additional case in which the authors seemed to have used a one-tailed test, yet the rounding was so far off the one-tailed result that statcheck did not accept it as “valid if one-tailed” any longer. There was one test marked as “one-tail” which statcheck did not recognize as one-tailed, yet that test also suffered from rounding errors.
Dependent-Samples Tests. There were 7 papers using dependent-samples methods (such as matched-pair tests or mixed-methods regressions). We found that statcheck treated the corresponding dependent-samples statistics correctly.
Multiple Comparison Corrections. In three cases, statcheck did not recognize p-values that were correctly Bonferroni-corrected, counting as three false positives. It is an open point, however, how many paper should have employed multiple-comparison corrections, but have not done so, an analysis statcheck does not perform.
1.2 A.2 Errors Committed by Authors
Typos. We considered 6 to be typos or transcription errors (\(18\%\)). Another 1 error seemed to be a copy-paste error (\(3\%\))
Rounding Errors. Of all 34 reported errors, we found 8 to be rounding errors (\(24\%\)).
Miscalculations. We found 13 cases to be erroneous calculations (\(38\%\)).
1.3 A.3 Composition of Incomplete p-Values
Of 1523 incomplete cases, 134 were declared “non-significant” without giving the actual p-value (\(8.8\%\)). Further, 6 were shown as \(p > .05\). (\(0.394\%\)).
Of the incomplete cases, 102 were reported statistically significant at a .05 significance level (\(6.7\%\)).
Of the incomplete cases, 477 were reported statistically significant at a lower significance level of .01, .001, or .0001 (\(31.3\%\)).
Of 1523 incomplete p-values, 680 gave an exact p-value (\(44.6\%\)). Of those exactly reported p-values, half (367) were claimed statistically significant at a significance level of \(\alpha = .05\) (\(54\%\)). Of those exatly reported p-values, 19 claimed an impossible p-value of \(p = 0\) (\(2.79\%\)).
Online Supplementary Materials
We made the materials of the study (specification of the inputted SLR, included sample, contingency tables) publicly available at its Open Science Framework Repository (see Footnote 1).
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Groß, T. (2021). Fidelity of Statistical Reporting in 10 Years of Cyber Security User Studies. In: Groß, T., Tryfonas, T. (eds) Socio-Technical Aspects in Security and Trust. STAST 2019. Lecture Notes in Computer Science(), vol 11739. Springer, Cham. https://doi.org/10.1007/978-3-030-55958-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-55958-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55957-1
Online ISBN: 978-3-030-55958-8
eBook Packages: Computer ScienceComputer Science (R0)