Elsevier

Journal of Informetrics

Volume 8, Issue 4, October 2014, Pages 963-971
Journal of Informetrics

Regression for citation data: An evaluation of different methods

https://doi.org/10.1016/j.joi.2014.09.011Get rights and content

Highlights

  • Ordinary least squares regression is recommended for citation data +1 after a logistic transformation.

  • The generalised linear model with lognormal residuals is recommended for citation data.

  • Inappropriate regression models can substantially inflate the chance of detecting false factors within citation data.

  • Regression models are evaluated for citation data and clear recommendations made for the best ones.

Abstract

Citations are increasingly used for research evaluations. It is therefore important to identify factors affecting citation scores that are unrelated to scholarly quality or usefulness so that these can be taken into account. Regression is the most powerful statistical technique to identify these factors and hence it is important to identify the best regression strategy for citation data. Citation counts tend to follow a discrete lognormal distribution and, in the absence of alternatives, have been investigated with negative binomial regression. Using simulated discrete lognormal data (continuous lognormal data rounded to the nearest integer) this article shows that a better strategy is to add one to the citations, take their log and then use the general linear (ordinary least squares) model for regression (e.g., multiple linear regression, ANOVA), or to use the generalised linear model without the log. Reasonable results can also be obtained if all the zero citations are discarded, the log is taken of the remaining citation counts and then the general linear model is used, or if the generalised linear model is used with the continuous lognormal distribution. Similar approaches are recommended for altmetric data, if it proves to be lognormally distributed.

Introduction

The use of performance monitoring for university research has increased over the past few decades. This is most evident in national research evaluation exercises, such of those in the UK (Mryglod, Kenna, Holovatch, & Berche, 2013), Australia (ARC, 2014), New Zealand (Anderson, Smart, & Tressler, 2013) and Italy (Abramo, D’Angelo, & Di Costa, 2011). This climate not only affects the allocation of research funding in many cases but can also change the behaviour of individual researchers as they come to terms with the assessment system (Butler, 2003). Although the most important performance monitoring exercises often rely on peer review, both the UK (REF, 2013) and Australia (ARC, 2014) consider citations for some subject areas, and there are advocates of increasing use of citations for some types of science when the results correlate because citation metrics are much cheaper than peer review (Abramo et al., 2013, Franceschet and Costantini, 2011, Mryglod et al., 2013), although no simple method is likely to work (Butler & Mcallister, 2011). In addition, citations are used for formal and informal evaluations of academics (Cole, 2000) and the Journal Impact Factor (JIF) is a widely recognised and used citation metric.

The range of metrics of relevance to science has recently increased with the emergence of webometrics (Almind & Ingwersen, 1997), which includes a range of new impact indicators derived from the web (Kousha & Thelwall, 2014) and altmetrics (Priem, Taraborelli, Groth, & Neylon, 2010), which incorporate many attention and impact indicators derived from social web sites (Priem, 2014). Altmetrics seem particularly promising to help researchers to identify recently published articles that have attracted a lot of attention (Adie & Roe, 2013) and to give evidence of non-standard impacts of research that can be added to CVs (ACUMEN, 2014, Piwowar and Priem, 2013). Statistical analyses of some of these have started to generate new insights into how science works (Mohammadi and Thelwall, 2014, Thelwall and Maflahi, in press) and the types of research impacts that are not recognised by traditional citation counts (Mohammadi & Thelwall, 2013).

Because of the many uses of citations within science, it is important to understand as much as possible about why they are created and why one article, researcher or research group may be more cited than another. Whilst citations may be given to acknowledge relevant previous work (Merton, 1973), they can also be used to criticise or argue against it (MacRoberts & MacRoberts, 1996) and so citations are not universally positive. Moreover, citations do not appear to be chosen in a dispassionate, unbiased way (Borgman & Furner, 2002). For example, researchers in some fields tend to cite papers written in the same language (Yitzhaki, 1998), highly relevant citations may be systematically ignored (McCain, 2012) and fame seems also to attract citations (Merton, 1968). There are also field differences in the typical number of citations received by papers (e.g., Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011) and review articles are more likely to be highly cited than other articles (e.g., Aksnes, 2003). From a different perspective, factors that associate with highly cited papers, such as collaboration, internationality and referencing patterns (Didegah and Thelwall, 2013b, Sooryamoorthy, 2009), are important because they may push researchers and funders towards more successful types of research.

Although some of the factors affecting citations discussed above have been discovered using qualitative methods, such as interviews with authors, statistical methods are needed to identify the magnitude of factors and to detect whether they apply in particular cases. The simplest approach is probably to compare the average number of citations (or citation-based indicators) for one set of papers against that of another to see which is higher (van Raan, 1998). Another strategy is to assess whether citation counts correlate significantly against another metric that is hypothesised to be related (Shema, Bar-Ilan, & Thelwall, 2014). The most discriminating methods used so far are regression-based because they allow the effects of multiple variables to be examined simultaneously. In particular, regression guards against one factor being identified as significant (e.g., international collaboration) when another factor (e.g., collaboration with the USA) is underlying cause of higher (or lower) citations.

There is no consensus about the best regression method for citation data. Methods used so far include ordinary least squares linear regression (Aksnes et al., 2013, Dragos and Dragos, 2014 [citations per publication used as the dependant variable]; Foo and Tan, 2014, He, 2009, Mavros et al., 2013, Rigby, 2013 [adding 1 to citations, dividing by a time normalised value and taking their log]; Tang, 2013 [adding 1 to citations and taking their log]; Stewart, 1983), logistic regression (Baldi, 1998, Bornmann and Williams, 2013, Kutlar et al., 2013, Sin, 2011, Willis et al., 2011, Xia and Nakanishi, 2012, Yu et al., 2014), a distribution-free regression method (Peters & van Raan, 1994), multinomial logistic regression (Baumgartner & Leydesdorff, 2014) and negative binomial regression (Chen, 2012, Didegah and Thelwall, 2013a, Didegah and Thelwall, 2013b, McDonald, 2007, Thelwall and Maflahi, in press [for altmetrics]; Walters, 2006, Yoshikane, 2013 [for patent citations]).

The typical distribution of citations is highly skewed (de Solla Price, 1965, Seglen, 1992), so that tests based upon the normal distribution (e.g., ordinary least squares regression) are not appropriate if the data is raw citation counts. Logistic regression can avoid this issue by predicting membership of the highly cited group of papers rather than directly predicting citations. Whilst negative binomial regression can cope with skewed data and is designed for discrete numbers (Hilbe, 2011), the most appropriate distribution for citations to a collection of articles from a single subject and year seems to be the discrete lognormal distribution (Evans et al., 2012, Radicchi et al., 2008, Thelwall and Wilson, 2014) and the hooked power law is also a reasonable choice (Thelwall & Wilson, 2014). Although many articles suggest a power law for the tail of citation distributions (e.g., Yao, Peng, Zhang, & Xu, 2014) this is not helpful for statistical analyses that need to include all cited articles and is broadly consistent with a lognormal distribution for all articles, although small discrepancies may be revealed by fine-grained analyses (Golosovsky & Solomon, 2012). Citations to articles from individual journals almost always conform to a lognormal distribution (Stringer, Sales-Pardo, & Amaral, 2010), as do some other citation-based indicators also follow a lognormal distribution (e.g., generalised h-indices: Wu, 2013). Although it has not been fully tested, it seems likely that most sets of articles from a specific time window will approximately follow a discrete lognormal distribution, unless the time window is too long or very recent. Hence it is not clear that negative binomial regression is optimal when the dependant variable is citation counts.

Neither the discrete lognormal or the hooked power law distributions have been used for regression because it seems that no software exists for this. An alternative strategy would be to take the log of the citations and then use the general linear model to analyse them with the assumption that the logged citations would be normally distributed (the general linear model assumes that the error terms or residuals are normally distributed). Although the log of the continuous version of the lognormal distribution is a perfect normal distribution, the same is not true for the discrete lognormal distribution and so it is not obvious that this will work. Moreover, the use of log transformation for citation data has been argued against for classification purposes because of the variance reduction that it introduces (Leydesdorff & Bensman, 2006), but this is not evidence that it will not work for regression. This article assesses both of these and the continuous normal distribution in order to identify the most powerful, regression-based approach for citations and similar data, such as altmetrics. The results will help to ensure that future statistical analyses of the factors affecting citation counts are as powerful and reliable as possible.

Section snippets

Citation distributions and statistical tests

Statistical regression models attempt to identify the relationship between a dependant variable (citations in this case) and one or more independent variables (e.g., the field in which an article is published or the number of authors). The relationship is typically identified by an algorithm that estimates parameters for the independent variables in order to generate a model that, for any given set of their values, predicts an expected value for the dependant variable. For example, if citations

Research question

In the absence of software that uses a discrete version of the lognormal distribution in the generalised linear model, this article assesses three logical alternatives, driven by the following research question: Which of the following is the most appropriate for discrete lognormal data in the sense that it does not give an inflated chance of false positive results whilst being powerful enough to distinguish minor factors in the data?

  • 1.

    Negative binomial regression for the raw data.

  • 2.

    Regression with

Methods

In order to address the research question, the three regression approaches must be applied to citation-like data with known relationships. This can be achieved by tests on simulated discrete lognormal data with and without relationships. Although there are infinitely many combinations of dependant variables that could be tested, the simplest case will give the clearest results and so only one relationship will be tested: that of a binary factor so that the two values of the factor correspond to

Results

The five approaches were first fitted to discrete lognormal data without any factors using 1000 simulations at each of a range of different sample sizes. As shown in Fig. 1, when the log of the mean is 0.5 and the log of the standard deviation is 1, both ANOVA variants and both continuous lognormal models incorrectly identify factors at the approximately the rate corresponding to the level of significance used (0.05), which is the desired behaviour and suggests that the deviation from normality

Discussion and conclusions

A limitation of the tests reported here is that they only deal with the simplest case of a single factor but it seems likely that the same conclusions would be drawn for more complex factors. Moreover, the unreliability of negative binomial regression for citations is probably overestimated by the simulation approach used here since real citation data will be bounded, especially if the data is from a small set of years that are not a long time in the past.

The results show that negative binomial

References (75)

  • G. Abramo et al.

    National research assessment exercises: A comparison of peer review and bibliometrics rankings

    Scientometrics

    (2011)
  • ACUMEN

    Guidelines for good evaluation practice with the ACUMEN Portfolio

    (2014)
  • E. Adie et al.

    Altmetric: Enriching scholarly content with article-level discussion and metrics

    Learned Publishing

    (2013)
  • J. Aitchison et al.

    The multivariate Poisson-log normal distribution

    Biometrika

    (1989)
  • D.W. Aksnes

    Characteristics of highly cited papers

    Research Evaluation

    (2003)
  • D.W. Aksnes et al.

    Are mobile researchers more productive and cited than non-mobile researchers? A large-scale study of Norwegian scientists

    Research Evaluation

    (2013)
  • T.C. Almind et al.

    Informetric analyses on the world wide web: Methodological approaches to ‘webometrics’

    Journal of Documentation

    (1997)
  • D.L. Anderson et al.

    Evaluating research – Peer review team assessment and journal based bibliographic measures: New Zealand PBRF research output scores in 2006

    New Zealand Economic Papers

    (2013)
  • ARC

    Excellence in Research for Australia (ERA)

    (2014)
  • S. Baldi

    Normative versus social constructivist processes in the allocation of citations: A network-analytic model

    American Sociological Review

    (1998)
  • S.E. Baumgartner et al.

    Group-based trajectory modeling (GBTM) of citations in scholarly literature: Dynamic qualities of transient and sticky knowledge claims

    Journal of the Association for Information Science and Technology

    (2014)
  • C. Borgman et al.
    (2002)
  • L. Butler et al.

    Evaluating university research performance using metrics

    European Political Science

    (2011)
  • C. Chen

    Predictive effects of structural variation on citation counts

    Journal of the American Society for Information Science and Technology

    (2012)
  • J.R. Cole

    A short history of the use of citations as a measure of the impact of scientific and scholarly work. The Web of Knowledge: A Festschrift in Honor of Eugene Garfield

    (2000)
  • D. de Solla Price

    Networks of scientific papers

    Science

    (1965)
  • F. Didegah et al.

    Determinants of research citation impact in nanoscience and nanotechnology

    Journal of the American Society for Information Science and Technology

    (2013)
  • A.J. Dobson et al.

    An introduction to generalized linear models

    (2008)
  • C.M. Dragos et al.

    Scientific productivity versus efficiency of R&D financing: Bibliometric analysis of African countries

    Current Science

    (2014)
  • T.S. Evans et al.

    Universality of performance indicators based on citation and reference counts

    Scientometrics

    (2012)
  • J.Y.A. Foo et al.

    Analysis and implications of retraction period and coauthorship of fraudulent publications

    Accountability in Research

    (2014)
  • D.G. Garson

    General linear models: Univariate GLM, ANOVA/ANCOVA, repeated measures

    (2012)
  • M. Golosovsky et al.

    Runaway events dominate the heavy tail of citation distributions

    The European Physical Journal – Special Topics

    (2012)
  • Z.L. He

    International collaboration does not have greater epistemic authority

    Journal of the American Society for Information Science and Technology

    (2009)
  • J.M. Hilbe

    Negative binomial regression

    (2011)
  • K. Kousha et al.

    Web impact metrics for research assessment

  • A. Kutlar et al.

    Contributions of Turkish academicians supervising PhD dissertations and their universities to economics: An evaluation of the 1990–2011 period

    Scientometrics

    (2013)
  • Cited by (126)

    View all citing articles on Scopus
    View full text