Long story short, The entire Exaly project has been developed from scratch, and no external library, resource, or database has been used whatsoever.
Datasets
The entire database has been created by directly analysing the articles published in scholarly journals. The validity of the extracted data has been cross-checked with other resources, but no other resources have been used.
Our purpose was to create a fully comprehensive database. There are several projects based on Crossref datasets since Crossref has a relaxed licence for re-production. However, there are two major problems: First, Crossref is a major DOI issuer and thus, exclusively focused on DOI as the identifier. Although most research articles have a DOI, there are still articles without a DOI and thus not included in Crossref datasets. So much the worse, as the main issuer of DOI for scholarly journals, Crossref often ignore the articles whose DOIs have been issued by other DOI issuers. Second, the datasets are directly supplied by the publishers, and there are discrepancies in the format of the data.
A significant part of the present project was to curate the data and re-organise the formats for statistical analysis.
Natural Language Processing
Natural Language Processing (NLP) plays a critical role in this project. A native programme was developed, which is substantially different from available NLP tools. First, we assume the published materials are correctly formatted, particularly character cases. For example, the is always considered the definite article, but THE is considered an acronym unless the entire sentence is in upper case (e.g., capitalised titles or headings). Second, the programme is developed for English grammar and vocabulary, and cannot be used for any other language. This is the reason that the non-English articles have been specifically identified and excluded from the linguistic analysis.
Whilst the available NLP tools parse the sentences into individual tokens, the present system parses the sentences into the grammatical structure of Subject-Verb-Object. Since most sentences are more complicated than this simple structure, sub-sentences or sub-phrases associated with the main subject or object are captured. In the present context, the most important mission is to capture the direct objects (and sometimes subjects) of the sentences because they often reflect the topics of the article under consideration (i.e., equivalent to technical keywords). Adverbs are of particular importance, but we differentiate between them when modifying the main verb or they are part of an object or subject.
Another major difference is that the available NLP programmes solely rely on the lemmatisation of tokens. In addition to that, we heavily rely on morphology. For instance, "formation" is not parsed any further in the NLP programmes, but we divide it into to form (verb) + -ation (suffix). Therefore, formation of something, forming something, and to form something are semantically interpreted synonymous.
The ultimate goal is to interpret the meaning to be understandable for the machine. However, this is not fully implemented at the present. In the preliminary attempt, the probability of the verb is measured using the modifying adverb. For example, in the sentence, "this tax policy has been implemented", we measure the probability of implementing the tax policy by the corresponding adverb (e.g., never, often, always). First, we match the unclear antecedent with their corresponding object in the preceding sentence. Let's say the previous sentence is "Milton Friedman proposed negative income tax." Therefore, the object of the present sentence can be translated from "this tax policy" to "negative income tax". Then, we can convey the meaning to the machine that the probability of implementing (the verb) of the object negative income tax is 0% (for never), 70% (for often) 100% (for always). This is incredibly useful when matching sentences from different articles. For instance, we can measure the probability in different situations (e.g., in different countries in the preceding example). Of course, the sentences are not always as simple as the above example, and so do the algorithms.
Software
The programme has been chiefly written in C, employing MariaDB as the relational database management system. Many tables have billions of rows, and it is not efficient to process the entire tables at the same time. Therefore, a native system was developed to benefit from parallel computing by processing the fragments of each table simultaneously.
The NLP programme described above was cross-checked with CoreNLP (by the Stanford NLP group), which is written in Java. Instead of implementing a C API, the test script was written in Java for the comparison purse, but the Java codes were not used in production.
No external library has been used, and all the data processing has been designed from scratch. The following section briefly outlines the overall approach and challenges.
Data Processing
Citations
Citations are counted by directly parsing the references. The journal, year, volume, and pages are extracted from each reference. Then, the ArticleID is retrieved from the article database with the corresponding information. The author names are ignored here. However, for the books, we heavily rely on matching the author names.
The scientometric factors such as the impact factor, h-index, g-index, L-index, etc. are calculated using the citations.
The most challenging step for computing the impact factor, which is currently the most important metric in scholarly publications, is the detection of citable documents. Each journal publishes various types of documents, from editorials and book reviews to research articles. They are not always marked by the publishers. So much the worse, a specific type can have two different meanings in two journals. For instance, Letters to the Editor are research articles in the form of short communications in some journals and opinion pieces by the readers in others. Since there is no reliable direct method for determining the article types, we use the combination of various factors such as the authors and their affiliations, the references, the article title, the submission information (the date of submission, revision, or acceptance), and the body of the text to determine if it is a peer-reviewed research article or not.
Owing to the fact the impact factor is directly calculated by dividing the number of citations by the number of articles, not excluding the non-citable documents can massively reduce the impact factor of a journal.
Authors
Nearly 5 million authors were recognised, who are associated with about 90 million articles.
In addition to the common problem of writing the names in different forms (initials instead of first names, dropping the middle names, etc.), different versions of accented names (i.e., very common among European names) were matched. On the other hand, some regions (mostly the Far East) use different character encodings. Therefore, it was necessary to detect the character encoding for each entry and translate it into UTF-8 if necessary.
Author matching was conducted by measuring the probability through various factors such as (none of them, in isolation, is enough for a reliable match):
• Affiliation: Matching the countries and institutions was used as a probability factor, but unfortunately, they are not reliable as a key parameter. If a name is common, it will be the case in a country where the institution is located. However, it can be fairly reliable at the department level, as it is not very common to have two faculty members at the same department with the exact same names. Extracting the institutions and departments is described in the next section.
• Citations: Authors tend to cite themselves. Matching the names with the citing authors can provide a large network in which many connections are correct matches.
• Co-Authors: It is not very common for an author to work with two different co-authors but with the exact same names.
Institutions and Departments
Over 60,000 institutions were specifically recognised from nearly 600 million entries of authors' affiliations.
Since most of them are higher education institutions, the academic departments were also captured.
Parsing the authors' affiliations is quite challenging because of the variations in the format.
First, there might be no separators between the components of the entire address. For example, "Department of Psychological Medicine Imperial College London Claybrook Centre Charing Cross Campus London UK". Therefore, we had to capture the named entities and find their functionality.
Second, No article information is available for old articles. The publishers simply scanned the hardcopies and re-wrote the article titles for the sake of web presence. This is the reason that a preview of the first page in PDF format is available on the publishers' websites. We had to parse the PDF files to extract the author names and affiliations.
Third, similar to names, the institutions are written in different forms, which should be matched. For instance, "LSE", "London School of Economics", and "London School of Economics and Political Science" are all the same institutions.
Fourth, although we are primarily interested in cities, we have to identify them to resolve ambiguity in common names. This is of utmost importance for university systems such as the University of California. We need the city to differentiate between University of California, Berkely and University of California, Los Angeles; though the latter is often written as UCLA.
Fifth, there is also ambiguity among geographical locations too. For example, California, Calif., and CA are synonymously used, but the latter may means Canada. In this case, we use the other components of the address to make the distinction. For example, it is not very likely to have two major cities in California and Canada with the same name. Any mention of the provinces of Canada can lead the distinction towards the latter.
Sixth, institutions (particularly European universities) can be written in different languages but with the same Latin alphabet (ASCII). For example, Technische Universität München and Technical University of Munich are the same institutions; and both versions have been equally used in the authors' affiliations.
Topics
Detecting the research topics of each article is based on Natural Language Processing system. Every sentence of the article's full text is parsed into the Subject-Verb-Object structure (a sentence may have several ones). The purpose is to catch subjects and objectives because the keywords are always a subject or an object in the sentence. Then, semantically synonymous keywords are grouped together. For example, lithium-ion battery, Li-ion battery, Li-ion cell, lithium rechargeable battery, lithium secondary cell, and so forth refer to the same topic. The tokens are, of course, normalised considering variants of the words such as American vs British spelling, singular vs plural, etc.
However, mentioning a keyword does not guarantee the coverage of the corresponding topic, as a research article normally mentions various examples which are related to the topic under consideration, but they are not the subject of investigation. Therefore, keywords are weighted based on their frequency, their place of appearance in the text, and their role in the sentence. For the latter, the semantics of the verb and adverbs are also considered.
In the next stage, the topics of each article are matched with the topics of its references and the articles that cited the article in question. Through the networks of the works references-article-citations, the common topics are detected. Particular attention is given to the topics of the citing articles because this approach can represent another phenomenon: an article can influence research in other fields.
A keyword always appears within the framework of a subject or an object, but they are not always entirely the keyword. For instance, various adjectives may be used, which are not necessarily (or, at least, practically) part of the intended keyword. In best lithium-ion battery, the adjective best is not very useful for creating a new keyword.