Algorithms

Text

Our NLP (natural language processing) approach is somewhat different from the available programmes. Whilst the standard method is to tokenise the sentences, we parse the sentences step by step.

The article texts are divided into sentences. Nearly 3 billion sentences have been parsed so far. Each sentence is broken into sub-sentences until each sentence has only one verb. Then, the sub-sentence is parsed by the classic subject-verb-object (SVO).

Authors

Recognising all articles of an author is one of the most difficult tasks of this project. We build a large network of links between the authors with similar names. It is not even possible to match the authors with the same names, as people tend to write their names in different forms (with or without middle names, first names or initials, accented letters, etc.). We then identify strong links based on various factors such as
Self-citations, but one can cite someone else with the same name.
Field of study, but it is very unlikely that someone else with the same name works in the same field.
Same institution/department, but even at the department level, there is a possibility of having two persons with the same name.
Style of writing, but this has limited applicability, as the articles are not written by a single author. However, we are able to match papers by the same groups based on the choices of words and phrases.

Upon finding several independent links, we establish the connection.

ORCID is a universal way to identify authors, but less than 5% of the author names are associated with an ORCID. We widely use it because it is the future of scholarly publications. However, it adds to the complexity at this stage, as some authors have multiple ORCIDs.

Institutions

Affiliations are parsed into known elements viz. Department, Institution, Address, City, Country. However, the process is not that simple because of non-standard formats. For example, many American journals drop United States from the address. Thus, the country should be deduced from the state. Many university hospitals do not mention the university name in the address.

Keywords

No author-provided keyword is used anywhere on this website because there is no generally accepted strategy for assigning the keywords. For instance, some authors consider the methods employed as keywords, but some do not. Furthermore, many journals do not even list keywords. On the other hand, some disciplines such as mathematics and economics have well-established classification codes. Therefore, relying on author-provided keywords will damage the overall integrity of the system across various disciplines.

Topics and Subjects

Topics are generalised keywords, but we do not search for them within the text. Instead, any common keyword that is associated with the given topic is searched throughout the article (title, abstract, and text). Subjects are a level higher covering topics.

The purpose of neither topics nor subjects is to classify the research objective of an article. Instead, we aim to identify the topics/subjects covered within the text for building the links between articles.