skip to main content
research-article

Contextual Data Cleaning with Ontology Functional Dependencies

Published:23 May 2022Publication History
Skip Abstract Section

Abstract

Functional Dependencies define attribute relationships based on syntactic equality, and when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms defined by an ontology. We study the theoretical foundations of OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Toward enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional Functional Dependencies.

REFERENCES

  1. [1] Ontobee. n.d. The Drug Ontology. Retrieved April 28, 2022 from http://www.ontobee.org/ontology/DRON.Google ScholarGoogle Scholar
  2. [2] Kaggle. n.d. Data Science for Good: Kiva Crowdfunding.Retrieved April 28, 2022 from https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding?select=loan_themes_by_region.csv.Google ScholarGoogle Scholar
  3. [3] NIH. 2016. Medical Ontology Research. Retrieved April 28, 2022 from https://mor.nlm.nih.gov.Google ScholarGoogle Scholar
  4. [4] Abedjan Z., Schulze P., and Naumann F.. 2014. DFD: Efficient functional dependency discovery. In Proceedings of CIKM. 949958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Agrawal R., Mannila H., Srikant R., Toivonen H., and Verkamo A.. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press, Menlo Park, CA, 307–328.Google ScholarGoogle Scholar
  6. [6] Baskaran S., Keller A., Chiang F., Golab L., and Szlichta J.. 2017. Efficient discovery of ontology functional dependencies. In Proceedings of CIKM. 18471856.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Bertossi L., Kolahi S., and Lakshmanan L.. 2011. Data cleaning and query answering with matching dependencies and matching functions. In Proceedings of ICDT. 268279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Beskales G., Ilyas Ihab F., Golab Lukasz, and Galiullin Artur. 2013. On the relative trust between inconsistent data and inaccurate constraints. In Proceedings of ICDE. 541552.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Bläsius T., Friedrich T., and Schirneck M.. 2022. The complexity of dependency detection and discovery in relational databases. Theor. Comput. Sci. 900 (2022), 7996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Bohannon P., Flaster M., Fan W., and Rastogi R.. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of SIGMOD. 143154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chiang F. and Gairola D.. 2018. InfoClean: Protecting sensitive information in data cleaning. ACM J. Data. Inf. Qual. 9, 4 (2018), Article 22.Google ScholarGoogle Scholar
  12. [12] Chiang F. and Miller R. J.. 2011. A unified model for data and constraint repair. In Proceedings of ICDE. 446457.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Chu X., Ilyas I., and Papotti P.. 2013. Discovering denial constraints. Proc. VLDB Endow. 6, 13 (2013), 14981509.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Chu X., Ilyas I., and Papotti P.. 2013. Holistic data cleaning: Putting violations into context. In Proceedings of ICDE. 458469.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Chu X., Morcos J., Ilyas I., Ouzzani M., Papotti P., Tang N., and Ye Y.. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of SIGMOD. 12471261.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Cong G., Fan W., Geerts F., Jia X., and Ma S.. 2007. Improving data quality: Consistency and accuracy. In Proceedings of VLDB. 315326.Google ScholarGoogle Scholar
  17. [17] Fan W., Li J., Ma S., Tang N., and Yu W.. 2011. Interaction between record matching and data repairing. In Proceedings of SIGMOD. 469480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Ferguson T.. 1989. Who solved the secretary problem?Statist. Sci. 4, 3 (1989), 294296.Google ScholarGoogle Scholar
  19. [19] Flach P. A. and Savnik I.. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Garey M. R. and Johnson. D. S.1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.Google ScholarGoogle Scholar
  21. [21] Geerts F., Mecca G., Papotti P., and Santoro D.. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (2013), 625636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Hao S., Tang N., Li G., He J., Ta N., and Feng J.. 2017. A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29, 4 (2017), 727742.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Hassanzadeh O., Kementsietsidis A., Lim L., Miller R. J., and Wang M.. 2009. LinkedCT: A Linked Data Space for Clinical Trials. Technical Report CSRG-596. University of Toronto.Google ScholarGoogle Scholar
  24. [24] Huang Y., Milani M., and Chiang F.. 2018. PACAS: Privacy-aware, data cleaning-as-a-service. In Proceedings ofBig Data. 10231030.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Huhtala Y., Kinen J., Porkka P., and Toivonen H.. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of ICDE. 392401.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Huynh V. and Papotti P.. 2018. Towards a benchmark for fact checking with knowledge bases. In Companion Proceedings of the Web Conference 2018. International World Wide Web Conferences Steering Committee, 15951598.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Kolahi S. and Lakshmanan L. V. S.. 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of ICDT. 5362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Koudas N., Saha A., Srivastava D., and Venkatasubramanian S.. 2009. Metric functional dependencies. In Proceedings of ICDE. 12751278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Levene M. and Loizou G.. 1998. Axiomatisation of functional dependencies in incomplete relations. Theor. Comput. Sci. 206, 1–2 (Oct. 1998), 283300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lien Y. E.. 1982. On the equivalence of database models. J. ACM 29, 2 (1982), 333362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Livshits E., Kimelfeld B., and Roy S.. 2020. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45, 1 (2020), Article 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Lopes S., Petit J., and Lakhal L.. 2000. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of EDBT. 350364.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lowerre B.. 1976. The HARPY Speech Recognition System. Ph.D. Dissertation, Carnegie-Mellon University.Google ScholarGoogle Scholar
  34. [34] Ma H., Alipourlangouri M., Wu Y., Chiang F., and Pi J.. 2019. Ontology-based entity matching in attributed graphs. Proc. VLDB Endow. 12, 10 (2019), 11951207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Mahdavi M. and Abedjan Z.. 2020. BARAN: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 13, 11 (2020), 19481961.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Miao D., Cai Z., Li J., Gao X., and Liu X.. 2020. The computation of optimal subset repairs. Proc. VLDB Endow. 13, 12 (2020), 20612074.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Motik B., Horrocks I., and Sattler U.. 2007. Adding integrity constraints to OWL. In Proceedings of OWLED.Google ScholarGoogle Scholar
  38. [38] Ng W.. 2001. An extension of the relational data model to incorporate ordered domains. ACM Trans. Database Syst. 26, 3 (2001), 344383.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Novelli N. and Cicchetti R.. 2001. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of ICDT. 189203.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Ortona S., Meduri V., and Papotti P.. 2018. Robust discovery of positive and negative rules in knowledge bases. In Proceedings of ICDE. 11681179.Google ScholarGoogle Scholar
  41. [41] Papenbrock T., Bergmann T., Finke M., Zwiener J., and Naumann F.. 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 18601871.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Papenbrock T., Kruse S., Quiané-Ruiz J., and Naumann F.. 2015. Divide and conquer-based inclusion dependency discovery. Proc. VLDB Endow. 8, 7 (2015), 774785.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Prokoshyna N., Szlichta J., Chiang F., Miller R., and Srivastava D.. 2015. Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9, 4 (2015), 300311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] R. Yossi, T. Carlo, and J. Leonidas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2, 40 (2000), 99121.Google ScholarGoogle Scholar
  45. [45] Rekatsinas T., Chu X., Ilyas I., and Ré C.. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10 (2017), 11901201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Rousseeuw P. J. and Croux C.. 1993. Alternatives to the median absolute deviation. J. Amer. Statist. Assoc. 88, 424 (1993), 12731283.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Rubner Y., Tomasi C., and Guibas L. J.. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40, 2 (2000), 99121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Szlichta J., Godfrey P., Golab L., Kargar M., and Srivastava D.. 2017. Effective and complete discovery of order dependencies via set-based axiomatization. Proc. VLDB Endow. 10, 7 (2017), 721732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Szlichta J., Godfrey P., and Gryz J.. 2012. Fundamentals of order dependencies. Proc. VLDB Endow. 5, 11 (2012), 12201231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Administration U.S. Food and Drug. 2018. National Drug Code Directory. Retrieved April 28, 2022 from https://www.fda.gov.Google ScholarGoogle Scholar
  51. [51] Wyss C., Giannella C., and Robertson E.. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining FDs from relations. In Proceedings of DaWaK. 101110.Google ScholarGoogle Scholar
  52. [52] Yakout M., Berti-Équille L., and Elmagarmid A.. 2013. Don’t be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of SIGMOD. 553564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Yao H. and Hamilton H.. 2008. Mining functional dependencies from data. Data Min. Knowl. Discov. 16, 2 (2008), 197219.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zheng Z., Zheng L., Langouri M. Alipour, Chiang F., Golab L., and Szlichta J.. 2022. Discovery and contextual data cleaning with ontology functional dependencies. arXiv:2105.08105 (2022).Google ScholarGoogle Scholar

Index Terms

  1. Contextual Data Cleaning with Ontology Functional Dependencies

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 14, Issue 3
        September 2022
        155 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/3533272
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 May 2022
        • Online AM: 21 April 2022
        • Accepted: 1 March 2022
        • Received: 1 June 2021
        • Revised: 1 January 2021
        Published in jdiq Volume 14, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format