Abstract
Edgar Codd introduced the principle of entity integrity in the context of his relational model of data. The principle says that every targeted real-world entity should be uniquely represented in the database. In actual database systems, entity integrity is typically enforced by primary keys. We introduce a framework toward generalizing entity integrity to different dimensions of data, including volume, variety, and veracity. We establish axiomatic and algorithmic foundations for the implication problem of the combined class of uniqueness constraints, functional dependencies, and multivalued dependencies in all combinations of the dimensions we consider. These are based on specific approaches to the semantics of these integrity constraints and to the dimensions of data. We also highlight how our concepts lead to new opportunities for diverse and important areas of applications, such as query optimization, database design and security, and data quality. Overall, this sets out an agenda for future research that extends our approaches or applies different approaches in this area, as driven by application requirements.
Similar content being viewed by others
References
Abiteboul S, Hull R, Vianu V (1995) Foundations of databases. Addison-Wesley, Boston
Arenas M (2006) Normalization theory for XML. SIGMOD Record 35(4):57–64
Arenas M, Libkin L (2005) An information-theoretic approach to normal forms for relational and XML data. J ACM 52(2):246–283
Atzeni P, Morfuni N (1986) Functional dependencies and constraints on null values in database relations. Inf Control 70(1):1–31
Baazizi MA, Colazzo D, Ghelli G, et al (2019) Schemas and types for JSON data: From theory to practice. In: Proceedings of the 2019 international conference on management of data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp 2060–2063
Balamuralikrishna N, Jiang Y, Koehler H et al (2019) Possibilistic keys. Fuzzy Sets Syst 376:1–36
Beeri C (1980) On the membership problem for functional and multivalued dependencies in relational databases. ACM Trans Database Syst 5(3):241–259
Beeri C, Bernstein PA (1979) Computational problems related to the design of normal form relational schemas. ACM Trans Database Syst 4(1):30–59
Beeri C, Fagin R, Howard JH (1977) A complete axiomatization for FDS and MVDS in database relations. In: SIGMOD Conference. ACM, pp 47–61
Bertossi LE (2011) Database repairing and consistent query answering. Synthesis lectures on data management, Morgan & Claypool Publishers
Biskup J (1995) Achievements of relational database schema design theory revisited. Semantics in Databases, Selected Papers from a Workshop, Prague, Czech Republic vol 1995, pp 29–54
Biskup J (2009) Security in computing systems. Springer, Heidelberg, Germany
Biskup J, Link S (2011) Appropriate inferences of data dependencies in relational databases. Ann Math Artif Intell 63(3–4):213–255
Biskup J, Weibert T (2008) Keeping secrets in incomplete databases. Int J Inf Sec 7(3):199–217
Biskup J, Embley D, Lochner J (2008) Reducing inference control to access control for normalized database schemas. Inf Proc Lett 106(1):8–12
Brown P, Link S (2017) Probabilistic keys. IEEE Trans Knowl Data Eng 29(3):670–682
Cali A, Calvanese D, De Giacomo G et al (2004) Data integration under integrity constraints. Inf Syst 29(2):147–163
Casanova MA, Fagin R, Papadimitriou CH (1984) Inclusion dependencies and their interaction with functional dependencies. J Comput Syst Sci 28(1):29–59
Chandra AK, Merlin PM (1977) Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4–6, 1977, Boulder, Colorado, USA, pp 77–90
Codd EF (1970) A relational model of data for large shared data banks. Commun ACM 13(6):377–387
Curino C, Moon HJ, Deutsch A et al (2013) Automating the database schema evolution process. VLDB J 22(1):73–98
Delobel C, Adiba M (1985) Relational database systems. North Holland, Amsterdam
Deutsch A, Popa L, Tannen V (2006) Query reformulation with constraints. SIGMOD Record 35(1):65–73
Dubois D, Prade H (1988) Possibility theory - an approach to computerized processing of uncertainty. Springer, Heidelberg
Dubois D, Prade H (2001) Possibility theory, probability theory and multiple-valued logics: a clarification. Ann Math Artif Intell 32(1–4):35–66
Fagin R (1977) Multivalued dependencies and a new normal form for relational databases. ACM Trans Database Syst 2(3):262–278
Fagin R, Kolaitis P, Miller R et al (2005) Data exchange: semantics and query answering. Theor Comput Sci 336(1):89–124
Fan W (2019) Dependencies for graphs: challenges and opportunities. J Data Inform Quality 11(2):1–12
Fan W, Geerts F, Jia X et al (2008) Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst 33(2):1–48
Fan W, Geerts F, Jia X et al (2008) Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst 33(2):1–48
Farkas C, Jajodia S (2002) The inference problem: a survey. SIGKDD Explor 4(2):6–11
Ferrarotti F, Hartmann S, Link S (2013) Efficiency frontiers of XML cardinality constraints. Data Knowl Eng 87:297–319
Gal A (2011) Uncertain schema matching. Synthesis Lectures on Data Management, Morgan & Claypool Publishers
Galil Z (1982) An almost linear-time algorithm for computing a dependency basis in a relational database. J ACM 29(1):96–102
Grinberg A (2018) XML and JSON recipes for SQL Server. Synthesis Lectures on Data Management, Apress Berkeley, CA,. https://doi.org/10.1007/978-1-4842-3117-3
Hannula M, Kontinen J (2016) A finite axiomatization of conditional independence and inclusion dependencies. Inf Comput 249:121–137
Hannula M, Link S (2018) Automated reasoning about key sets. In: Automated reasoning - 9th international joint conference, IJCAR 2018, Held as Part of the Federated Logic Conference, FloC 2018, Proceedings, pp 47–63
Hartmann S, Link S (2004) Multi-valued dependencies in the presence of lists. In: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, June 14–16, 2004, Paris, France, pp 330–341
Hartmann S, Link S (2007) Unlocking keys for XML trees. In: Database Theory - ICDT 2007, 11th international conference, Barcelona, Spain, January 10–12, 2007, Proceedings, pp 104–118
Hartmann S, Link S (2009) Efficient reasoning about a robust XML key fragment. ACM Trans Database Syst 34(2):1–33
Hartmann S, Link S (2009b) Expressive, yet tractable XML keys. In: EDBT 2009, 12th international conference on extending database technology, Saint Petersburg, Russia, March 24–26, 2009, Proceedings, pp 357–367
Hartmann S, Link S (2010) Numerical constraints on XML data. Inf Comput 208(5):521–544
Hartmann S, Link S (2010b) When data dependencies over SQL tables meet the logics of paradox and \(S\)-3. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PoDS), pp 317–326
Hartmann S, Link S (2012) The implication problem of data dependencies over SQL table definitions: axiomatic, algorithmic and logical characterizations. ACM Trans Database Syst 37(2):1–40
Hartmann S, Köhler H, Link S et al (2008) On the notion of an XML key. In: Semantics in data and knowledge bases, third international workshop, SDKB 2008, Nantes, France, March 29, 2008, Revised Selected Papers, pp 103–112
Hartmann S, Link S, Trinh T (2010) Solving the implication problem for XML functional dependencies with properties. In: Logic, language, information and computation, 17th international workshop, WoLLIC 2010, Brasilia, Brazil, July 6–9, 2010. Proceedings, pp 161–175
Imielinski T, Jr. WL (1983) Incomplete information and dependencies in relational databases. In: SIGMOD’83, proceedings of annual meeting, San Jose, California, USA, May 23–26, 1983, pp 178–184
Johnson DS, Klug AC (1984) Testing containment of conjunctive queries under functional and inclusion dependencies. J Comput Syst Sci 28(1):167–189
Klug A, Price R (1982) Determining view dependencies using tableaux. ACM Trans Database Syst 7(3):361–380
Köhler H, Link S (2010) Armstrong axioms and Boyce-Codd-Heath normal form under bag semantics. Inf Process Lett 110(16):717–724
Köhler H, Link S (2016) Qualitative cleaning of uncertain data. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2269–2274
Köhler H, Link S (2017) Inclusion dependencies and their interaction with functional dependencies in SQL. J Comput Syst Sci 85:104–131
Köhler H, Link S (2018) SQL schema design: foundations, normal forms, and normalization. Inf Syst 76:88–113
Köhler H, Link S (2022) Possibilistic data cleaning. IEEE Trans Knowl Data Eng 34(12):5939–5950
Kolahi S (2007) Dependency-preserving normalization of relational and XML data. J Comput Syst Sci 73(4):636–647
Kossmann J, Papenbrock T, Naumann F (2022) Data dependencies for query optimization: a survey. VLDB J 31(1):1–22
Levene M, Loizou G (1999) Database design for incomplete relations. ACM Trans Database Syst 24(1):80–125
Levene M, Loizou G (1999) A guided tour of relational databases and beyond. Springer, Heidelberg
Levene M, Loizou G (2001) A generalisation of entity and referential integrity in relational databases. ITA 35(2):113–127
Levene M, Vincent MW (2000) Justification for inclusion dependency normal form. IEEE Trans Knowl Data Eng 12(2):281–291
Lien E (1982) On the equivalence of database models. J ACM 29(2):333–362
Link S (2008) On the implication of multivalued dependencies in partial database relations. Int J Found Comput Sci 19(3):691–715
Link S (2012) Characterisations of multivalued dependency implication over undetermined universes. J Comput Syst Sci 78(4):1026–1044
Link S (2020) Neo4j keys. In: Conceptual modeling - 39th international conference, ER 2020, Vienna, Austria, November 3–6, 2020, Proceedings, pp 19–33
Link S (2022) Object normal form, fourth normal form and their application to database security. In: Conceptual modeling - 41st international conference, ER 2022, Hyderabad, India, October 17–20, 2022, Proceedings, pp 349–364
Link S, Prade H (2016) Possibilistic functional dependencies and their relationship to possibility theory. IEEE Trans Fuzzy Syst 24(3):757–763
Link S, Prade H (2019) Relational database schema design for uncertain data. Inf Syst 84:88–110
Link S (2021) Wei Z (2021) Logical schema design that quantifies update inefficiency and join efficiency. In: Li G, Li Z, Idreos S et al (eds) SIGMOD’21: International conference on management of data. Virtual Event, China, June 20–25, pp 1169–1181
Livshits E, Kimelfeld B, Roy S (2020) Computing optimal repairs for functional dependencies. ACM Trans Database Syst 45(1):1–46
Mok WY (2016) Utilizing nested normal form to design redundancy free JSON schemas. iJES 4(4):21–25
Naumann F, Herschel M (2010) An introduction to duplicate detection. Synthesis Lectures on Data Management, Morgan & Claypool Publishers
Papadakis G, Ioannou E, Thanos E, et al (2021) The four generations of entity resolution. Synthesis Lectures on Data Management, Morgan & Claypool Publishers
Paredaens J, De Bra P, Gyssens M et al (1989) The Structure of the Relational Database Model. Springer, Heidelberg
Roblot T, Hannula M, Link S (2018) Probabilistic cardinality constraints - validation, reasoning, and semantic summaries. VLDB J 27(6):771–795
Saha B, Srivastava D (2014) Data quality: The other face of big data. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 1294–1297
Skavantzos P, Zhao K, Link S (2021) Uniqueness constraints on property graphs. In: Advanced information systems engineering - 33rd international conference, CAiSE 2021, Melbourne, VIC, Australia, June 28 - July 2, 2021, Proceedings, pp 280–295
Suciu D, Olteanu D, Ré C, et al (2011) Probabilistic databases. Synthesis Lectures on Data Management, Morgan & Claypool Publishers
Thalheim B (1984) A complete axiomatization for full join dependencies in relations. Bulletin of the EATCS 24:109–114
Thalheim B (1989) On semantic issues connected with keys in relational databases permitting null values. Elektron Inform Kybern 25(1/2):11–20
Thalheim B (1991) Dependencies in relational databases. Teubner, Braunschweig
Thalheim B (1992) Fundamentals of cardinality constraints. In: ER, pp 7–23
Vincent MW (1997) A corrected 5NF definition for relational database design. Theor Comput Sci 185(2):379–391
Vincent MW, Liu J, Liu C (2004) Strong FDs and their application to normal forms in XML. ACM Trans Database Syst 29(3):445–462
Vincent MW, Liu J, Liu C (2004) Strong functional dependencies and their application to normal forms in XML. ACM Trans Database Syst 29(3):445–462
Wei Z, Link S (2019) Embedded functional dependencies and data-completeness tailored database design. PVLDB 12(11):1458–1470
Wei Z, Link S (2019b) A fourth normal form for uncertain data. In: Advanced information systems engineering - 31st international conference, CAiSE 2019, Rome, Italy, June 3–7, 2019, Proceedings, pp 295–311
Wei Z, Link S (2021) Embedded functional dependencies and data-completeness tailored database design. ACM Trans Database Syst 46(2):1–46
Wei Z, Leck U, Link S (2019) Discovery and ranking of embedded uniqueness constraints. PVLDB 12(13):2339–2352
Wei Z, Leck U, Link S (2019b) Entity integrity, referential integrity, and query optimization with embedded uniqueness constraints. In: 35th IEEE international conference on data engineering, ICDE 2019, Macao, China, April 8–11, 2019, pp 1694–1697
Wu M (1992) The practical need for fourth normal form. In: ACM SIGCSE conference, pp 19–23
Zaniolo C (1980) Mixed transitivity for functional and multivalued dependencies in database relations. Inf Process Lett 10(1):32–34
Zaniolo C (1984) Database relations with null values. J Comput Syst Sci 28(1):142–166
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Link, S. Entity integrity management under data volume, variety and veracity. Knowl Inf Syst 65, 2895–2934 (2023). https://doi.org/10.1007/s10115-022-01814-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01814-1