Combining Contents and Citations for Scientific Document Classification

Cao, Minh Duc; Gao, Xiaoying

doi:10.1007/11589990_17

Combining Contents and Citations for Scientific Document Classification

Minh Duc Cao²⁰ &
Xiaoying Gao²⁰

Conference paper

2465 Accesses
10 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3809))

Abstract

This paper introduces a classification system that exploits the content information as well as citation structure for scientific paper classification. The system first applies a content-based statistical classification method which is similar to general text classification. We investigate several classification methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees. Among those methods, the K-nearest neighbours is found to outperform others while the rest perform comparably. Using phrases in addition to words and a good feature selection strategy such as information gain can improve system accuracy and reduce training time in comparison with using words only. To combine citation links for classification, the system proposes an iterative method to update the labellings of classified instances using citation links. Our results show that, combining contents and citations significantly improves the system performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Borko, H., Bernick, M.: Automatic document classification. J. ACM 10, 151–162 (1963)
Article MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)
Google Scholar
Witten, I.H., Frank, E.: Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI-1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)
Google Scholar
Wiener, E., Pedersen, L.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proc. of the Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998)
Article Google Scholar
Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2003)
Article MATH MathSciNet Google Scholar
Taskar, B., Segal, E., Koller, D.: Probabilistic classification and clustering in relational data. In: Nebel, B. (ed.) Proceeding of IJCAI-2001, 17th International Joint Conference on Artificial Intelligence, Seattle, US, pp. 870–878 (2001)
Google Scholar
Craven, M., Slattery, S.: Relational learning with statistical predicate invention: Better models for hypertext. Mach. Learn. 43, 97–119 (2001)
Article MATH Google Scholar
Quinlan, J.R.: Learning logical definitions from relations. Mach. Learn. 5, 239–266 (1990)
Google Scholar
Cohen, W.: Learning to classify English text with ILP methods. In: Advances in Inductive Logic Programming, pp. 124–143. IOS Press, Amsterdam (1996)
Google Scholar
Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J. (ed.) Proceedings of the 1st Workshop on Learning Language in Logic, Bled, Slovenia, pp. 84–93 (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, 313–316 (1997)
Google Scholar
McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3, 127–163 (2000)
Article Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI-1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)
Google Scholar
Lewis, D.: An evaluation of prasal and clustered representation of text categorisation tasks. In: Proceedings of SIGIR-1992, 15th ACM International Conference on Reseach and Deveplopment in Information Retrieval, pp. 289–297 (1992)
Google Scholar
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 307–318. ACM Press, New York (1998)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematics, Statistics & Computer Science, Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand
Minh Duc Cao & Xiaoying Gao

Authors

Minh Duc Cao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Guangxi Normal University, College of CS and IT, Guilin, China, and University of Technology, Faculty of Engineering and Information Technology, Sydney, Australia
Shichao Zhang
Department of Electrical and Computer Systems Engineering, Monash University, 3800, Melbourne, Victoria, Australia
Ray Jarvis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, M.D., Gao, X. (2005). Combining Contents and Citations for Scientific Document Classification. In: Zhang, S., Jarvis, R. (eds) AI 2005: Advances in Artificial Intelligence. AI 2005. Lecture Notes in Computer Science(), vol 3809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11589990_17

Download citation

DOI: https://doi.org/10.1007/11589990_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30462-3
Online ISBN: 978-3-540-31652-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics