Skip to main content
Log in

Exploiting edge semantics in citation graphs using efficient, vertical ARM

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Graphs are increasingly becoming a vital source of information within which a great deal of semantics is embedded. As the size of available graphs increases, our ability to arrive at the embedded semantics grows into a much more complicated task. One form of important hidden semantics is that which is embedded in the edges of directed graphs. Citation graphs serve as a good example in this context. This paper attempts to understand temporal aspects in publication trends through citation graphs, by identifying patterns in the subject matters of scientific publications using an efficient, vertical association rule mining model. Such patterns can (a) indicate subject-matter evolutionary history, (b) highlight subject-matter future extensions, and (c) give insights on the potential effects of current research on future research. We highlight our major differences with previous work in the areas of graph mining, citation mining, and Web-structure mining, propose an efficient vertical data representation model, introduce a new subjective interestingness measure for evaluating patterns with a special focus on those patterns that signify strong associations between properties of cited papers and citing papers, and present an efficient algorithm for the purpose of discovering rules of interest followed by a detailed experimental analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Buneman P, Jajodia S (eds) Proceedings of the ACM SIGMOD international conference on management of data, May 1993. Washington, DC, USA. ACM Press, pp 207–216

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th VLDB international conference on very large databases, September 1994. Morgan Kaufmann, Santiago, Chile, pp 487–499

  3. Akutsu T, Miyano S, Kuhara S (1999) Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. In: Proceedings of the PSB Pacific symposium on biocomputing. Hawaii, USA, January 1999, pp 17–28

  4. AIP (2004) American Institute of Physics. http://www.aip.org, 24 June 2004

  5. Borgelt C (2003) Efficient implementations of Apriori and ECLAT. In: Geothals B, Zaki M (eds) CEUR workshop proceedings of the IEEE ICDM 1st FIMI international workshop frequent itemset mining implementations, 6 June 2004. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-90/

  6. Brown M, Grundy WN, Lin D et al (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Genetics 97(1):262–267

    Google Scholar 

  7. Chakrabarti S, Dom BE, Gibson D et al (1999) Mining the link structure of the World Wide Web. IEEE Comput 32(8):60–67

    Google Scholar 

  8. Ding Q, Ding Qi, Perrizo W (2002a) Association rule mining on remotely sensed images using P-trees. In: Cheng MS, Yu PS, Liu B (eds) Proceedings of the 6th PAKDD Pacific-Asian conference on knowledge discovery and data mining, Taipei, Taiwan, May 2002. Lecture notes in computer science 2336. Springer-Verlag, Berlin Heidelberg New York, pp 66–79

  9. Ding Q, Khan M, Roy A et al (2002b) The P-tree algebra. In: Panda B (ed) Proceedings of the 17th ACM SAC symposium on applied computing, Madrid, Spain, March 2002. ACM Press, pp 413–417

  10. Egghe L, Rousseau R (1990) Introduction to informetrics: quantitative methods in library, documentation and information science. Elsevier, Amsterdam

    Google Scholar 

  11. FIMI (2003) Frequent Itemset Mining Implementations Repository. http://fimi.cs.helsinki.fi, 1 June 2004

  12. Geurts K, Wets G, Brijs T et al (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd TRB annual meeting of the transportation tesearch board. Washington, DC, USA, January 2003, p 18

  13. Glover F (1994) Tabu search for nonlinear and parametric optimization (with links to genetic algorithms). Discrete Appl Math 49(1–3):231–255

    Article  MathSciNet  MATH  Google Scholar 

  14. Goethals B (2002) Personal Website. http://www.adrem.ua.ac.be/~goethals/, 2 March 2004

  15. Goethals B (2003) Survey on frequent pattern mining. http://www.adrem.ua.ac.be/~goethals/, 30 December 2004

  16. Goethals B (2004) Memory issues in frequent itemset mining. In: Haddad H, Omicini A, Wainwright RL, Liebrock LM (eds) Proceeding of the 19th ACM SAC symposium on applied computing. Nicosia, Cyprus, March 2004. ACM Press, pp 530–534

  17. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the ACM SIGMOD international conference on management of data, Dallas, Texas, USA, June 2002. ACM Press, pp 1–12

  18. Heilbroner RL, Galbraith JK (1986) The nature and logic of capitalism. W.W. Norton and Company, New York

    Google Scholar 

  19. IBM (2004) IBM Quest Synthetic Data Generator. http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html, 7 April 2004

  20. Inokuchi A, Washio T, Motoda H (2000) An Apriori-based algorithm for mining frequent substructures from graph data. In: Zighed DA, Komorowski HJ, Zytkow JM (eds) Proceedings of the PKDD European conference on principles and practice of knowledge discovery in databases, Lyon, France, September 2000. Lecture notes in computer science 1910. Springer, Berlin Heidelberg New York, pp 13–23

  21. KDD Cup Competition (2003) The 9th ACM SIGKDD CUP international conference on knowledge discovery and data mining. http://www.cs.cornell.edu/projects/kddcup, 30 December 2004

  22. Khan M, Ding Q, Perrizo W (2002) textitk-nearest neighbor classification on spatial data streams using P-trees. In: Cheng MS, Yu PS, Liu B (eds) Proceedings of the 6th PAKDD Pacific-Asian conference on knowledge discovery and data mining. Taipei, Taiwan, May 2002. Lecture notes in computer science 2336. Springer, pp 517–518

  23. Kleinberg JM (1998) Authoritative sources in a hyperlinked environment. In: Proceedings of the ACM-SIAM SODA symposium on discrete algorithms. San Francisco, California, USA, January 1998, pp 668–677

  24. Kosters WA, Pijls W (2003) APRIORI, a depth first implementation. In: Geothals B, Zaki M (eds) CEUR workshop proceedings of the IEEE ICDM 1st FIMI international workshop frequent itemset mining implementations. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-90/, 6 June 2004

  25. Kostoff RN, Del Rio JA, Humenik JA, et al (2001) Citation mining: integrating text mining and bibliometrics for research user profiling. J Am Soc Inform Sci Technol 52(13)

    Google Scholar 

  26. Kuramochi M, Karypis G (2002) An efficient algorithm for discovering frequent subgraphs. Technical Report 02-026, University of Minnesota, Department of Computer Science, http://www.users.cs.umn.edu/~karypis/publications/Papers/PDF/fsg2.pdf, 20 February 2004

  27. Law A, Kelton W (2000) Experimental design, sensitivity analysis, and optimization. In: Simulation modeling and analysis. McGraw-Hill, Singapore, pp 622–668

    Google Scholar 

  28. Liu B, Hsu W (1996) Post-analysis of learned rules. In: Proceedings of the AAAI national conference on artificial intelligence. Portland, Oregon, USA, August 1996. AAAI Press/MIT Press, pp 828–834

  29. Matsuda T, Motoda H, Yoshida T et al (2002) Mining patterns from structured data by beam-wise graph-based induction. In: Lange S, Satoh K, Smith CH (eds) Proceedings of the 5th DS international conference on discovery science. London, UK, November 2002. Lecture notes in computer science 2534, pp 422–429

  30. Osareh F (1996) Bibliometrics, citation analysis and co-citation analysis: a review of literature I. Libri 46(4):149–158

    Article  Google Scholar 

  31. Oyama T, Kitano K, Satou K, et al (2000) Mining association rules related to protein–protein interactions. Genome Inform 11:358–359

    Google Scholar 

  32. Padmanabhan B, Tuzhilin A (1999) Unexpectedness as a measure of interestingness in knowledge discovery. Decision Support Syst 27(3)

  33. Parthasarathy S, Coatney M (2002) Efficient discovery of common substructures in macromolecules. In: Proceedings of the 2nd IEEE ICDM international conference on data mining. IEEE Computer Society 2002. Maebashi City, Japan, December 2002, pp 362–369

  34. Perrizo W (2001) Peano count tree technology lab notes. Technical Report NDSU-CS-TR-01, North Dakota State University, Computer Science Department, http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/pct.html, 18 January 2003

  35. Perrizo W, Ding Q, Denton A et al (2003) PINE–-podium incremental neighbor evaluator for spatial data using ptrees. In: Proceedings of the ACM SAC symposium on applied computing. Melbourne, Florida, USA, March 2003. ACM Press, pp 503–508

  36. Perrizo W, Ding Q, Roy A (2001) Deriving high confidence rules from spatial data using peano count trees. In: Wang WS, Yu G, Lu H (eds) Proceedings of the WAIM international conference on web-page information management, Xi'an, China. Lecture notes in computer science 2118. Springer, Berlin Heidelberg New York, pp 91–102.

  37. Pietracaprina A, Zandolin D (2003) Mining frequent itemsets using patricia tries. In: Geothals B, Zaki M (eds) CEUR workshop proceedings of the IEEE ICDM 1st FIMI international workshop on frequent itemset mining implementations. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS//Vol-90/, 6 June 2004

  38. Pipenbacker P, Schliep A, Schneckener S et al (2002) ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 18:182–191

    Article  PubMed  Google Scholar 

  39. Pray L (2003) Unraveling protein–protein interactions. Scientist 17(2). http://www.the-scientist.com/yr2003/jan/lcprofile1_030127.html, 30 December 2004

  40. Rahal I, Perrizo W (2004) An optimized approach for knn text categorization using p-trees. In: Haddad H, Omicini A, Wainwright RL, Liebrock LM (eds) Proceedings of the 19th ACM SAC symposium on applied computing. Nicosia, Cyprus, March 2004. ACM Press, pp 613–617

  41. Serazi M, Perera A, Ding Q, et al (2004) DataMIME™. In: Weikum G, König AC, Deßloch S (eds) Proceedings of the ACM SIGMOD international conference on management of data. Paris, France, June 2004. ACM Press, pp 923–924

  42. Shenoy P, Haristsa JR, Sudatsham S. et al (2000) Turbo-charging vertical mining of large databases. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the ACM SIGMOD international conference on management of data. Dallas, Texas, USA, May 2000. ACM Press, pp 22–29

  43. Silberschatz A, Tuzhilin A (1995) On subjective measures of interestingness in knowledge discovery. In: Fayyad U, Uthurusamy R (eds) Proceedings of the 1st ACM SIGKDD international conference on knowledge discovery and data mining. Montreal, Canada, August 1995. AAAI Press, pp 275–281

  44. Silberschatz A, Tuzhilin A (1996) What makes patterns interesting in knowledge discovery systems. Special issue on data mining. IEEE TKDE Trans Knowledge Data Eng 8(6):970–974

    Article  Google Scholar 

  45. Suzuki E (1997) Autonomous discovery of reliable exception rules. In: Heckerman D, Mannila H, Pregibon D (eds) Proceedings of the 3rd ACM SIGKDD international conference on knowledge discovery and data mining. Newport Beach, California, USA, August 1997. AAAI Press, pp 259–262

  46. TouchGraph LLC (2001) http://www.touchgraph.com [24 November 2004]

  47. UCI (1998) University of California Irvine data repository. http://www.ics.uci.edu/~mlearn/MLRepository.html [30 December 2004]

  48. Wang B, Pan F, Ren D et al (2003) Efficient olap operations for spatial using Peano trees. In: Zaki M, Aggrawal C (eds) Proceedings of the ACM SIGMOD DMKD workshop on data mining and knowledge discovery. San Diego, California, USA, June 2003. ACM Press, pp 28–34

  49. Yahoo! Inc. (2005). http://www.yahoo.com [10 February 2005]

  50. Yan X, Han J (2002) gSpan: graph-based substructure pattern mining. In: Proceedings of the IEEE ICDM international conference on data mining. Maebashi City, Japan, December 2002. IEEE Computer Society, pp 721–724

  51. Zaki M, Gouda K (2003) Fast vertical mining using diffsets. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC, USA, August 2003. ACM Press, pp 326–335

  52. Zaki M, Parthasarathy S, Ogihara M, et al (1997) New algorithms for fast discovery of association rules. In: Heckerman D, Mannila H, Pregibon D (eds). Proceedings of the 3rd ACM SIGKDD international conference on knowledge discovery and data mining. Newport Beach, California, USA, August 1997. AAAI Press, pp 283–286

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imad Rahal.

Additional information

Imad Rahal is a newly appointed assistant professor in the Department of Computer Science at the College of Saint Benedict ∣ Saint John's University, Collegeville, MN, and a Ph.D. candidate at North Dakota State University, Fargo, ND. In August 2003, he earned his master's degree in computer science from North Dakota State University. Prior to that, he graduated summa cum laude from the Lebanese American University, Beirut, Lebanon, in February 2001 with a bachelor's degree in computer science. Currently, he is completing the final requirements for his Ph.D. degree in computer science on an NSF ND-EPSCoR doctoral dissertation assistantship with August of 2005 as a projected completion date. He is very active in research, proposal writing, and publications; his research interests are largely in the broad areas of data mining, machine learning, databases, artificial intelligence, and bioinformatics.

Dongmei Ren is working for the Database Technology Institute for z/OS, IBM Silicon Valley Lab, San Jose, CA, as a staff software engineer. She holds a Ph.D. degree from North Dakota State University, Fargo, ND, and master's and bachelor's degrees from TianJin University, TianJin, China. She has been a software engineer at DaTang Telecommunications, Beijing, China. Her areas of expertise are outlier analysis, data mining and knowledge discovery, database systems, machine learning, intelligent systems, wireless networks and bioinformatics. She has been awarded the Siemens Scholarship research enhancement for excellent performance in study and research. She is a member of ACM, IEEE.

Weihua Wu is a network monitoring & managed services analyst at Hewlett-Packard Co. in Canada. He holds a master's degree from North Dakota State University and a bachelor's degree from Nanjing University, both in computer science. His research areas of interest include data mining, knowledge discovery, data warehousing, information technology, network security, and bioinformatics. He has participated in various projects supported by NSF, DARPA, NASA, USDA, and GSA grants.

Anne Denton is an assistant professor in computer science at North Dakota State University. Her research interests are in data mining, knowledge discovery in scientific data, and bioinformatics. Specific interests include data mining of diverse data, in which objects are characterized by a variety of properties such as numerical and categorical attributes, graphs, sequences, time-dependent attributes, and others. She received her Ph.D. in physics from the University of Mainz, Germany, and her M.S. in computer science from North Dakota State University, Fargo, ND.

Christopher Besemann received his M.Sc. in computer science from North Dakota State University in Fargo, ND, 2005. Currently, he works in data mining research topics including association mining and relational data mining with recent work in model integration as a research assistant. He is accepted under a fellowship program for Ph.D. study at North Dakota State University.

William Perrizo is a professor of computer science at North Dakota State University. He holds a Ph.D. degree from the University of Minnesota, a master's degree from the University of Wisconsin and a bachelor's degree from St. John's University. He has been a research scientist at the IBM Advanced Business Systems Division and the U.S. Air Force Electronic Systems Division. His areas of expertise are data mining, knowledge discovery, database systems, distributed database systems, high speed computer and communications networks, precision agriculture and bioinformatics. He is a member of ISCA, ACM, IEEE, IAAA, and AAAS.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rahal, I., Ren, D., Wu, W. et al. Exploiting edge semantics in citation graphs using efficient, vertical ARM. Knowl Inf Syst 10, 57–91 (2006). https://doi.org/10.1007/s10115-005-0229-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0229-2

Keywords

Navigation