Abstract
Functional Dependencies define attribute relationships based on syntactic equality, and when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies (OFDs), which express semantic attribute relationships such as synonyms defined by an ontology. We study the theoretical foundations of OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Toward enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional Functional Dependencies.
- [1] Ontobee. n.d. The Drug Ontology. Retrieved April 28, 2022 from http://www.ontobee.org/ontology/DRON.Google Scholar
- [2] Kaggle. n.d. Data Science for Good: Kiva Crowdfunding.Retrieved April 28, 2022 from https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding?select=loan_themes_by_region.csv.Google Scholar
- [3] NIH. 2016. Medical Ontology Research. Retrieved April 28, 2022 from https://mor.nlm.nih.gov.Google Scholar
- [4] . 2014. DFD: Efficient functional dependency discovery. In Proceedings of CIKM. 949–958.Google ScholarDigital Library
- [5] . 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press, Menlo Park, CA, 307–328.Google Scholar
- [6] . 2017. Efficient discovery of ontology functional dependencies. In Proceedings of CIKM. 1847–1856.Google ScholarDigital Library
- [7] . 2011. Data cleaning and query answering with matching dependencies and matching functions. In Proceedings of ICDT. 268–279.Google ScholarDigital Library
- [8] . 2013. On the relative trust between inconsistent data and inaccurate constraints. In Proceedings of ICDE. 541–552.Google ScholarDigital Library
- [9] . 2022. The complexity of dependency detection and discovery in relational databases. Theor. Comput. Sci. 900 (2022), 79–96.Google ScholarDigital Library
- [10] . 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of SIGMOD. 143–154.Google ScholarDigital Library
- [11] . 2018. InfoClean: Protecting sensitive information in data cleaning. ACM J. Data. Inf. Qual. 9, 4 (2018), Article 22.Google Scholar
- [12] . 2011. A unified model for data and constraint repair. In Proceedings of ICDE. 446–457.Google ScholarDigital Library
- [13] . 2013. Discovering denial constraints. Proc. VLDB Endow. 6, 13 (2013), 1498–1509.Google ScholarDigital Library
- [14] . 2013. Holistic data cleaning: Putting violations into context. In Proceedings of ICDE. 458–469.Google ScholarDigital Library
- [15] . 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of SIGMOD. 1247–1261.Google ScholarDigital Library
- [16] . 2007. Improving data quality: Consistency and accuracy. In Proceedings of VLDB. 315–326.Google Scholar
- [17] . 2011. Interaction between record matching and data repairing. In Proceedings of SIGMOD. 469–480.Google ScholarDigital Library
- [18] . 1989. Who solved the secretary problem?Statist. Sci. 4, 3 (1989), 294–296.Google Scholar
- [19] . 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139–160.Google ScholarDigital Library
- [20] 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.Google Scholar
- [21] . 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (2013), 625–636.Google ScholarDigital Library
- [22] . 2017. A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29, 4 (2017), 727–742.Google ScholarDigital Library
- [23] . 2009. LinkedCT: A Linked Data Space for Clinical Trials.
Technical Report CSRG-596. University of Toronto.Google Scholar - [24] . 2018. PACAS: Privacy-aware, data cleaning-as-a-service. In Proceedings ofBig Data. 1023–1030.Google ScholarCross Ref
- [25] . 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of ICDE. 392–401.Google ScholarDigital Library
- [26] . 2018. Towards a benchmark for fact checking with knowledge bases. In Companion Proceedings of the Web Conference 2018. International World Wide Web Conferences Steering Committee, 1595–1598.Google ScholarDigital Library
- [27] . 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of ICDT. 53–62.Google ScholarDigital Library
- [28] . 2009. Metric functional dependencies. In Proceedings of ICDE. 1275–1278.Google ScholarDigital Library
- [29] . 1998. Axiomatisation of functional dependencies in incomplete relations. Theor. Comput. Sci. 206, 1–2 (
Oct. 1998), 283–300.Google ScholarDigital Library - [30] . 1982. On the equivalence of database models. J. ACM 29, 2 (1982), 333–362.Google ScholarDigital Library
- [31] . 2020. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45, 1 (2020), Article 4.Google ScholarDigital Library
- [32] . 2000. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of EDBT. 350–364.Google ScholarCross Ref
- [33] . 1976. The HARPY Speech Recognition System. Ph.D. Dissertation, Carnegie-Mellon University.Google Scholar
- [34] . 2019. Ontology-based entity matching in attributed graphs. Proc. VLDB Endow. 12, 10 (2019), 1195–1207.Google ScholarDigital Library
- [35] . 2020. BARAN: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 13, 11 (2020), 1948–1961.Google ScholarDigital Library
- [36] . 2020. The computation of optimal subset repairs. Proc. VLDB Endow. 13, 12 (2020), 2061–2074.Google ScholarDigital Library
- [37] . 2007. Adding integrity constraints to OWL. In Proceedings of OWLED.Google Scholar
- [38] . 2001. An extension of the relational data model to incorporate ordered domains. ACM Trans. Database Syst. 26, 3 (2001), 344–383.Google ScholarDigital Library
- [39] . 2001. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of ICDT. 189–203.Google ScholarCross Ref
- [40] . 2018. Robust discovery of positive and negative rules in knowledge bases. In Proceedings of ICDE. 1168–1179.Google Scholar
- [41] . 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 1860–1871.Google ScholarDigital Library
- [42] . 2015. Divide and conquer-based inclusion dependency discovery. Proc. VLDB Endow. 8, 7 (2015), 774–785.Google ScholarDigital Library
- [43] . 2015. Combining quantitative and logical data cleaning. Proc. VLDB Endow. 9, 4 (2015), 300–311.Google ScholarDigital Library
- [44] . 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2, 40 (2000), 99–121.Google Scholar
- [45] . 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10 (2017), 1190–1201.Google ScholarDigital Library
- [46] . 1993. Alternatives to the median absolute deviation. J. Amer. Statist. Assoc. 88, 424 (1993), 1273–1283.Google ScholarCross Ref
- [47] . 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40, 2 (2000), 99–121.Google ScholarDigital Library
- [48] . 2017. Effective and complete discovery of order dependencies via set-based axiomatization. Proc. VLDB Endow. 10, 7 (2017), 721–732.Google ScholarDigital Library
- [49] . 2012. Fundamentals of order dependencies. Proc. VLDB Endow. 5, 11 (2012), 1220–1231.Google ScholarDigital Library
- [50] . 2018. National Drug Code Directory. Retrieved April 28, 2022 from https://www.fda.gov.Google Scholar
- [51] . 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining FDs from relations. In Proceedings of DaWaK. 101–110.Google Scholar
- [52] . 2013. Don’t be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of SIGMOD. 553–564.Google ScholarDigital Library
- [53] . 2008. Mining functional dependencies from data. Data Min. Knowl. Discov. 16, 2 (2008), 197–219.Google ScholarDigital Library
- [54] . 2022. Discovery and contextual data cleaning with ontology functional dependencies. arXiv:2105.08105 (2022).Google Scholar
Index Terms
- Contextual Data Cleaning with Ontology Functional Dependencies
Recommendations
Contextual Data Cleaning with Ontology FDs
SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataFunctional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We motivate the need to include ...
Efficient Discovery of Ontology Functional Dependencies
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementFunctional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when used in data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We enhance dependency-based data ...
Foundational Challenges in Automated Semantic Web Data and Ontology Cleaning
Applying automated reasoning systems to data cleaning in the Semantic Web raises many foundational challenges regarding cleaning-agent design. The authors discuss these challenges and then argue that we can achieve logic trust in the Semantic Web only ...
Comments