ABSTRACT
As the number of published scholarly articles grows steadily each year, new methods are needed to organize scholarly knowledge so that it can be more efficiently discovered and used. Natural Language Processing (NLP) techniques are able to autonomously process scholarly articles at scale and to create machine readable representations of the article content. However, autonomous NLP methods are by far not sufficiently accurate to create a high-quality knowledge graph. Yet quality is crucial for the graph to be useful in practice. We present TinyGenius, a methodology to validate NLP-extracted scholarly knowledge statements using microtasks performed with crowdsourcing. The scholarly context in which the crowd workers operate has multiple challenges. The explainability of the employed NLP methods is crucial to provide context in order to support the decision process of crowd workers. We employed TinyGenius to populate a paper-centric knowledge graph, using five distinct NLP methods. In the end, the resulting knowledge graph serves as a digital library for scholarly articles.
- Rubayyi Alghamdi and Khalid Alfalqi. 2015. A Survey of Topic Modeling in Text Mining. International Journal of Advanced Computer Science and Applications 6 (01 2015). Google ScholarCross Ref
- Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI '17). Association for Computing Machinery, New York, NY, USA, 2334--2346. Google ScholarDigital Library
- Justin Cheng, Jaime Teevan, Shamsi T. Iqbal, and Michael S. Bernstein. 2015. Break It Down: A Comparison of Macro- and Microtasks. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI '15). Association for Computing Machinery, New York, NY, USA, 4061--4064. Google ScholarDigital Library
- Gobinda G. Chowdhury. 2003. Natural language processing. Annual Review of Information Science and Technology 37, 1 (2003), 51--89. Google ScholarCross Ref
- Benjamin M. Good and Andrew I. Su. 2013. Crowdsourcing for bioinformatics. Bioinformatics 29, 16 (06 2013), 1925--1933. Google ScholarCross Ref
- Olaf Hartig. 2017. Foundations of RDF* and SPARQL* : (An Alternative Approach to Statement-Level Metadata in RDF). In Proceedings of the 11th Alberto Mendelzon International Workshop on Foundations of Data Management and the Web 2017 : (CEUR Workshop Proceedings, Vol. 1912). Article 12. http://ceur-ws.org/Vol-1912/paper12.pdfGoogle Scholar
- Mohamad Yaser Jaradeh, Allard Oelen, Kheir Eddine Farfar, Manuel Prinz, Jennifer D'Souza, Gábor Kismihók, Markus Stocker, and Sören Auer. 2019. Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge. K-CAP 2019 - Proceedings of the 10th International Conference on Knowledge Capture (2019), 243--246. Google ScholarDigital Library
- Arif Jinha. 2010. Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing 23, 3 (2010), 258--263. Google ScholarCross Ref
- Steven Komarov, Katharina Reinecke, and Krzysztof Z. Gajos. 2013. Crowd-sourcing Performance Evaluations of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI '13). Association for Computing Machinery, New York, NY, USA, 207--216. Google ScholarDigital Library
- Ora Lassila, Ralph R Swick, et al. 1998. Resource description framework (RDF) model and syntax specification. (1998).Google Scholar
- Thomas D. LaToza, W. Ben Towne, Christian M. Adriano, and André van der Hoek. 2014. Microtask Programming: Building Software with a Crowd. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (Honolulu, Hawaii, USA) (UIST '14). Association for Computing Machinery, New York, NY, USA, 43--54. Google ScholarDigital Library
- Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2022. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 50--70. Google ScholarDigital Library
- Barend Mons and Jan Velterop. 2009. Nano-publication in the e-science era. CEUR Workshop Proceedings 523 (2009).Google Scholar
- Allard Oelen, Markus Stocker, and Sören Auer. 2021. Crowdsourcing Scholarly Discourse Annotations. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI '21). 464--474. Google ScholarDigital Library
- Eric Prudhommeaux and Andy Seaborne. 2008. SPARQL query language for RDF. (2008). http://www.w3.org/TR/rdf-sparql-query/Google Scholar
- Cristina Sarasua, Elena Simperl, and Natalya F. Noy. 2012. CrowdMap: Crowdsourcing Ontology Alignment with Microtasks. In The Semantic Web - ISWC 2012, Philippe Cudré-Mauroux, Jeff Heflin, Evren Sirin, Tania Tudorache, Jérôme Euzenat, Manfred Hauswirth, Josiane Xavier Parreira, Jim Hendler, Guus Schreiber, Abraham Bernstein, and Eva Blomqvist (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 525--541. Google ScholarDigital Library
- Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering 27, 2 (2015), 443--460. Google ScholarCross Ref
- Markus Stocker, Pauli Paasonen, Markus Fiebig, Martha A Zaidan, and Alex Hardisty. 2018. Curating Scientific Information in Knowledge Infrastructures. Data Science Journal 17 (2018). Google ScholarCross Ref
- Oguzhan Tas and Farzad Kiyani. 2007. A survey automatic text summarization. PressAcademia Procedia 5, 1 (2007), 205--213. Google ScholarCross Ref
- Jaime Teevan, Daniel J. Liebling, and Walter S. Lasecki. 2014. Selfsourcing Personal Tasks. In CHI '14 Extended Abstracts on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI EA '14). 2527--2532. Google ScholarDigital Library
Index Terms
- TinyGenius: intertwining natural language processing with microtask crowdsourcing for scholarly knowledge graph creation
Recommendations
Creating a Scholarly Knowledge Graph from Survey Article Tables
Digital Libraries at Times of Massive Societal TransitionAbstractDue to the lack of structure, scholarly knowledge remains hardly accessible for machines. Scholarly knowledge graphs have been proposed as a solution. Creating such a knowledge graph requires manual effort and domain experts, and is therefore time-...
A Novel Curated Scholarly Graph Connecting Textual and Data Publications
In the last decade, scholarly graphs became fundamental to storing and managing scholarly knowledge in a structured and machine-readable way. Methods and tools for discovery and impact assessment of science rely on such graphs and their quality to serve ...
Pattern-Based Acquisition of Scientific Entities from Scholarly Article Titles
Towards Open and Trustworthy Digital SocietiesAbstractWe describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s ...
Comments