Skip to main content

Combining Contents and Citations for Scientific Document Classification

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3809))

Abstract

This paper introduces a classification system that exploits the content information as well as citation structure for scientific paper classification. The system first applies a content-based statistical classification method which is similar to general text classification. We investigate several classification methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees. Among those methods, the K-nearest neighbours is found to outperform others while the rest perform comparably. Using phrases in addition to words and a good feature selection strategy such as information gain can improve system accuracy and reduce training time in comparison with using words only. To combine citation links for classification, the system proposes an iterative method to update the labellings of classified instances using citation links. Our results show that, combining contents and citations significantly improves the system performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Borko, H., Bernick, M.: Automatic document classification. J. ACM 10, 151–162 (1963)

    Article  MATH  Google Scholar 

  2. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  3. Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)

    Google Scholar 

  4. Witten, I.H., Frank, E.: Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  5. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI-1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)

    Google Scholar 

  7. Wiener, E., Pedersen, L.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proc. of the Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)

    Google Scholar 

  8. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  9. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  10. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998)

    Article  Google Scholar 

  11. Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  12. Taskar, B., Segal, E., Koller, D.: Probabilistic classification and clustering in relational data. In: Nebel, B. (ed.) Proceeding of IJCAI-2001, 17th International Joint Conference on Artificial Intelligence, Seattle, US, pp. 870–878 (2001)

    Google Scholar 

  13. Craven, M., Slattery, S.: Relational learning with statistical predicate invention: Better models for hypertext. Mach. Learn. 43, 97–119 (2001)

    Article  MATH  Google Scholar 

  14. Quinlan, J.R.: Learning logical definitions from relations. Mach. Learn. 5, 239–266 (1990)

    Google Scholar 

  15. Cohen, W.: Learning to classify English text with ILP methods. In: Advances in Inductive Logic Programming, pp. 124–143. IOS Press, Amsterdam (1996)

    Google Scholar 

  16. Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J. (ed.) Proceedings of the 1st Workshop on Learning Language in Logic, Bled, Slovenia, pp. 84–93 (1999)

    Google Scholar 

  17. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

  18. Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, 313–316 (1997)

    Google Scholar 

  19. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3, 127–163 (2000)

    Article  Google Scholar 

  20. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI-1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  21. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)

    Google Scholar 

  22. Lewis, D.: An evaluation of prasal and clustered representation of text categorisation tasks. In: Proceedings of SIGIR-1992, 15th ACM International Conference on Reseach and Deveplopment in Information Retrieval, pp. 289–297 (1992)

    Google Scholar 

  23. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 307–318. ACM Press, New York (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cao, M.D., Gao, X. (2005). Combining Contents and Citations for Scientific Document Classification. In: Zhang, S., Jarvis, R. (eds) AI 2005: Advances in Artificial Intelligence. AI 2005. Lecture Notes in Computer Science(), vol 3809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11589990_17

Download citation

  • DOI: https://doi.org/10.1007/11589990_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30462-3

  • Online ISBN: 978-3-540-31652-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics