Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Entity integrity management under data volume, variety and veracity

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Edgar Codd introduced the principle of entity integrity in the context of his relational model of data. The principle says that every targeted real-world entity should be uniquely represented in the database. In actual database systems, entity integrity is typically enforced by primary keys. We introduce a framework toward generalizing entity integrity to different dimensions of data, including volume, variety, and veracity. We establish axiomatic and algorithmic foundations for the implication problem of the combined class of uniqueness constraints, functional dependencies, and multivalued dependencies in all combinations of the dimensions we consider. These are based on specific approaches to the semantics of these integrity constraints and to the dimensions of data. We also highlight how our concepts lead to new opportunities for diverse and important areas of applications, such as query optimization, database design and security, and data quality. Overall, this sets out an agenda for future research that extends our approaches or applies different approaches in this area, as driven by application requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Abiteboul S, Hull R, Vianu V (1995) Foundations of databases. Addison-Wesley, Boston

    MATH  Google Scholar 

  2. Arenas M (2006) Normalization theory for XML. SIGMOD Record 35(4):57–64

    Article  Google Scholar 

  3. Arenas M, Libkin L (2005) An information-theoretic approach to normal forms for relational and XML data. J ACM 52(2):246–283

    Article  MathSciNet  MATH  Google Scholar 

  4. Atzeni P, Morfuni N (1986) Functional dependencies and constraints on null values in database relations. Inf Control 70(1):1–31

    Article  MathSciNet  MATH  Google Scholar 

  5. Baazizi MA, Colazzo D, Ghelli G, et al (2019) Schemas and types for JSON data: From theory to practice. In: Proceedings of the 2019 international conference on management of data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp 2060–2063

  6. Balamuralikrishna N, Jiang Y, Koehler H et al (2019) Possibilistic keys. Fuzzy Sets Syst 376:1–36

    Article  MathSciNet  MATH  Google Scholar 

  7. Beeri C (1980) On the membership problem for functional and multivalued dependencies in relational databases. ACM Trans Database Syst 5(3):241–259

    Article  MATH  Google Scholar 

  8. Beeri C, Bernstein PA (1979) Computational problems related to the design of normal form relational schemas. ACM Trans Database Syst 4(1):30–59

    Article  Google Scholar 

  9. Beeri C, Fagin R, Howard JH (1977) A complete axiomatization for FDS and MVDS in database relations. In: SIGMOD Conference. ACM, pp 47–61

  10. Bertossi LE (2011) Database repairing and consistent query answering. Synthesis lectures on data management, Morgan & Claypool Publishers

  11. Biskup J (1995) Achievements of relational database schema design theory revisited. Semantics in Databases, Selected Papers from a Workshop, Prague, Czech Republic vol 1995, pp 29–54

  12. Biskup J (2009) Security in computing systems. Springer, Heidelberg, Germany

    MATH  Google Scholar 

  13. Biskup J, Link S (2011) Appropriate inferences of data dependencies in relational databases. Ann Math Artif Intell 63(3–4):213–255

    Article  MathSciNet  MATH  Google Scholar 

  14. Biskup J, Weibert T (2008) Keeping secrets in incomplete databases. Int J Inf Sec 7(3):199–217

    Article  Google Scholar 

  15. Biskup J, Embley D, Lochner J (2008) Reducing inference control to access control for normalized database schemas. Inf Proc Lett 106(1):8–12

    Article  MathSciNet  MATH  Google Scholar 

  16. Brown P, Link S (2017) Probabilistic keys. IEEE Trans Knowl Data Eng 29(3):670–682

    Article  Google Scholar 

  17. Cali A, Calvanese D, De Giacomo G et al (2004) Data integration under integrity constraints. Inf Syst 29(2):147–163

    Article  MATH  Google Scholar 

  18. Casanova MA, Fagin R, Papadimitriou CH (1984) Inclusion dependencies and their interaction with functional dependencies. J Comput Syst Sci 28(1):29–59

    Article  MathSciNet  MATH  Google Scholar 

  19. Chandra AK, Merlin PM (1977) Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4–6, 1977, Boulder, Colorado, USA, pp 77–90

  20. Codd EF (1970) A relational model of data for large shared data banks. Commun ACM 13(6):377–387

    Article  MATH  Google Scholar 

  21. Curino C, Moon HJ, Deutsch A et al (2013) Automating the database schema evolution process. VLDB J 22(1):73–98

    Article  Google Scholar 

  22. Delobel C, Adiba M (1985) Relational database systems. North Holland, Amsterdam

    MATH  Google Scholar 

  23. Deutsch A, Popa L, Tannen V (2006) Query reformulation with constraints. SIGMOD Record 35(1):65–73

    Article  Google Scholar 

  24. Dubois D, Prade H (1988) Possibility theory - an approach to computerized processing of uncertainty. Springer, Heidelberg

    MATH  Google Scholar 

  25. Dubois D, Prade H (2001) Possibility theory, probability theory and multiple-valued logics: a clarification. Ann Math Artif Intell 32(1–4):35–66

    Article  MathSciNet  MATH  Google Scholar 

  26. Fagin R (1977) Multivalued dependencies and a new normal form for relational databases. ACM Trans Database Syst 2(3):262–278

    Article  MathSciNet  Google Scholar 

  27. Fagin R, Kolaitis P, Miller R et al (2005) Data exchange: semantics and query answering. Theor Comput Sci 336(1):89–124

    Article  MathSciNet  MATH  Google Scholar 

  28. Fan W (2019) Dependencies for graphs: challenges and opportunities. J Data Inform Quality 11(2):1–12

    Article  Google Scholar 

  29. Fan W, Geerts F, Jia X et al (2008) Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst 33(2):1–48

    Article  Google Scholar 

  30. Fan W, Geerts F, Jia X et al (2008) Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst 33(2):1–48

    Article  Google Scholar 

  31. Farkas C, Jajodia S (2002) The inference problem: a survey. SIGKDD Explor 4(2):6–11

    Article  Google Scholar 

  32. Ferrarotti F, Hartmann S, Link S (2013) Efficiency frontiers of XML cardinality constraints. Data Knowl Eng 87:297–319

    Article  Google Scholar 

  33. Gal A (2011) Uncertain schema matching. Synthesis Lectures on Data Management, Morgan & Claypool Publishers

  34. Galil Z (1982) An almost linear-time algorithm for computing a dependency basis in a relational database. J ACM 29(1):96–102

    Article  MathSciNet  MATH  Google Scholar 

  35. Grinberg A (2018) XML and JSON recipes for SQL Server. Synthesis Lectures on Data Management, Apress Berkeley, CA,. https://doi.org/10.1007/978-1-4842-3117-3

  36. Hannula M, Kontinen J (2016) A finite axiomatization of conditional independence and inclusion dependencies. Inf Comput 249:121–137

    Article  MathSciNet  MATH  Google Scholar 

  37. Hannula M, Link S (2018) Automated reasoning about key sets. In: Automated reasoning - 9th international joint conference, IJCAR 2018, Held as Part of the Federated Logic Conference, FloC 2018, Proceedings, pp 47–63

  38. Hartmann S, Link S (2004) Multi-valued dependencies in the presence of lists. In: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, June 14–16, 2004, Paris, France, pp 330–341

  39. Hartmann S, Link S (2007) Unlocking keys for XML trees. In: Database Theory - ICDT 2007, 11th international conference, Barcelona, Spain, January 10–12, 2007, Proceedings, pp 104–118

  40. Hartmann S, Link S (2009) Efficient reasoning about a robust XML key fragment. ACM Trans Database Syst 34(2):1–33

    Article  Google Scholar 

  41. Hartmann S, Link S (2009b) Expressive, yet tractable XML keys. In: EDBT 2009, 12th international conference on extending database technology, Saint Petersburg, Russia, March 24–26, 2009, Proceedings, pp 357–367

  42. Hartmann S, Link S (2010) Numerical constraints on XML data. Inf Comput 208(5):521–544

    Article  MathSciNet  MATH  Google Scholar 

  43. Hartmann S, Link S (2010b) When data dependencies over SQL tables meet the logics of paradox and \(S\)-3. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PoDS), pp 317–326

  44. Hartmann S, Link S (2012) The implication problem of data dependencies over SQL table definitions: axiomatic, algorithmic and logical characterizations. ACM Trans Database Syst 37(2):1–40

    Article  Google Scholar 

  45. Hartmann S, Köhler H, Link S et al (2008) On the notion of an XML key. In: Semantics in data and knowledge bases, third international workshop, SDKB 2008, Nantes, France, March 29, 2008, Revised Selected Papers, pp 103–112

  46. Hartmann S, Link S, Trinh T (2010) Solving the implication problem for XML functional dependencies with properties. In: Logic, language, information and computation, 17th international workshop, WoLLIC 2010, Brasilia, Brazil, July 6–9, 2010. Proceedings, pp 161–175

  47. Imielinski T, Jr. WL (1983) Incomplete information and dependencies in relational databases. In: SIGMOD’83, proceedings of annual meeting, San Jose, California, USA, May 23–26, 1983, pp 178–184

  48. Johnson DS, Klug AC (1984) Testing containment of conjunctive queries under functional and inclusion dependencies. J Comput Syst Sci 28(1):167–189

    Article  MathSciNet  MATH  Google Scholar 

  49. Klug A, Price R (1982) Determining view dependencies using tableaux. ACM Trans Database Syst 7(3):361–380

    Article  MathSciNet  MATH  Google Scholar 

  50. Köhler H, Link S (2010) Armstrong axioms and Boyce-Codd-Heath normal form under bag semantics. Inf Process Lett 110(16):717–724

    Article  MathSciNet  MATH  Google Scholar 

  51. Köhler H, Link S (2016) Qualitative cleaning of uncertain data. In: Proceedings of the 25th ACM international conference on information and knowledge management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp 2269–2274

  52. Köhler H, Link S (2017) Inclusion dependencies and their interaction with functional dependencies in SQL. J Comput Syst Sci 85:104–131

    Article  MathSciNet  MATH  Google Scholar 

  53. Köhler H, Link S (2018) SQL schema design: foundations, normal forms, and normalization. Inf Syst 76:88–113

    Article  Google Scholar 

  54. Köhler H, Link S (2022) Possibilistic data cleaning. IEEE Trans Knowl Data Eng 34(12):5939–5950

    Article  Google Scholar 

  55. Kolahi S (2007) Dependency-preserving normalization of relational and XML data. J Comput Syst Sci 73(4):636–647

    Article  MathSciNet  MATH  Google Scholar 

  56. Kossmann J, Papenbrock T, Naumann F (2022) Data dependencies for query optimization: a survey. VLDB J 31(1):1–22

    Article  Google Scholar 

  57. Levene M, Loizou G (1999) Database design for incomplete relations. ACM Trans Database Syst 24(1):80–125

    Article  MATH  Google Scholar 

  58. Levene M, Loizou G (1999) A guided tour of relational databases and beyond. Springer, Heidelberg

    Book  MATH  Google Scholar 

  59. Levene M, Loizou G (2001) A generalisation of entity and referential integrity in relational databases. ITA 35(2):113–127

    MathSciNet  MATH  Google Scholar 

  60. Levene M, Vincent MW (2000) Justification for inclusion dependency normal form. IEEE Trans Knowl Data Eng 12(2):281–291

    Article  Google Scholar 

  61. Lien E (1982) On the equivalence of database models. J ACM 29(2):333–362

    Article  MATH  Google Scholar 

  62. Link S (2008) On the implication of multivalued dependencies in partial database relations. Int J Found Comput Sci 19(3):691–715

    Article  MathSciNet  MATH  Google Scholar 

  63. Link S (2012) Characterisations of multivalued dependency implication over undetermined universes. J Comput Syst Sci 78(4):1026–1044

    Article  MathSciNet  MATH  Google Scholar 

  64. Link S (2020) Neo4j keys. In: Conceptual modeling - 39th international conference, ER 2020, Vienna, Austria, November 3–6, 2020, Proceedings, pp 19–33

  65. Link S (2022) Object normal form, fourth normal form and their application to database security. In: Conceptual modeling - 41st international conference, ER 2022, Hyderabad, India, October 17–20, 2022, Proceedings, pp 349–364

  66. Link S, Prade H (2016) Possibilistic functional dependencies and their relationship to possibility theory. IEEE Trans Fuzzy Syst 24(3):757–763

    Article  Google Scholar 

  67. Link S, Prade H (2019) Relational database schema design for uncertain data. Inf Syst 84:88–110

    Article  Google Scholar 

  68. Link S (2021) Wei Z (2021) Logical schema design that quantifies update inefficiency and join efficiency. In: Li G, Li Z, Idreos S et al (eds) SIGMOD’21: International conference on management of data. Virtual Event, China, June 20–25, pp 1169–1181

  69. Livshits E, Kimelfeld B, Roy S (2020) Computing optimal repairs for functional dependencies. ACM Trans Database Syst 45(1):1–46

    Article  MathSciNet  Google Scholar 

  70. Mok WY (2016) Utilizing nested normal form to design redundancy free JSON schemas. iJES 4(4):21–25

    Google Scholar 

  71. Naumann F, Herschel M (2010) An introduction to duplicate detection. Synthesis Lectures on Data Management, Morgan & Claypool Publishers

  72. Papadakis G, Ioannou E, Thanos E, et al (2021) The four generations of entity resolution. Synthesis Lectures on Data Management, Morgan & Claypool Publishers

  73. Paredaens J, De Bra P, Gyssens M et al (1989) The Structure of the Relational Database Model. Springer, Heidelberg

    Book  MATH  Google Scholar 

  74. Roblot T, Hannula M, Link S (2018) Probabilistic cardinality constraints - validation, reasoning, and semantic summaries. VLDB J 27(6):771–795

    Article  Google Scholar 

  75. Saha B, Srivastava D (2014) Data quality: The other face of big data. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 1294–1297

  76. Skavantzos P, Zhao K, Link S (2021) Uniqueness constraints on property graphs. In: Advanced information systems engineering - 33rd international conference, CAiSE 2021, Melbourne, VIC, Australia, June 28 - July 2, 2021, Proceedings, pp 280–295

  77. Suciu D, Olteanu D, Ré C, et al (2011) Probabilistic databases. Synthesis Lectures on Data Management, Morgan & Claypool Publishers

  78. Thalheim B (1984) A complete axiomatization for full join dependencies in relations. Bulletin of the EATCS 24:109–114

    Google Scholar 

  79. Thalheim B (1989) On semantic issues connected with keys in relational databases permitting null values. Elektron Inform Kybern 25(1/2):11–20

    MathSciNet  Google Scholar 

  80. Thalheim B (1991) Dependencies in relational databases. Teubner, Braunschweig

    Book  MATH  Google Scholar 

  81. Thalheim B (1992) Fundamentals of cardinality constraints. In: ER, pp 7–23

  82. Vincent MW (1997) A corrected 5NF definition for relational database design. Theor Comput Sci 185(2):379–391

    Article  MathSciNet  MATH  Google Scholar 

  83. Vincent MW, Liu J, Liu C (2004) Strong FDs and their application to normal forms in XML. ACM Trans Database Syst 29(3):445–462

    Article  Google Scholar 

  84. Vincent MW, Liu J, Liu C (2004) Strong functional dependencies and their application to normal forms in XML. ACM Trans Database Syst 29(3):445–462

    Article  Google Scholar 

  85. Wei Z, Link S (2019) Embedded functional dependencies and data-completeness tailored database design. PVLDB 12(11):1458–1470

    Google Scholar 

  86. Wei Z, Link S (2019b) A fourth normal form for uncertain data. In: Advanced information systems engineering - 31st international conference, CAiSE 2019, Rome, Italy, June 3–7, 2019, Proceedings, pp 295–311

  87. Wei Z, Link S (2021) Embedded functional dependencies and data-completeness tailored database design. ACM Trans Database Syst 46(2):1–46

    Article  MathSciNet  Google Scholar 

  88. Wei Z, Leck U, Link S (2019) Discovery and ranking of embedded uniqueness constraints. PVLDB 12(13):2339–2352

    Google Scholar 

  89. Wei Z, Leck U, Link S (2019b) Entity integrity, referential integrity, and query optimization with embedded uniqueness constraints. In: 35th IEEE international conference on data engineering, ICDE 2019, Macao, China, April 8–11, 2019, pp 1694–1697

  90. Wu M (1992) The practical need for fourth normal form. In: ACM SIGCSE conference, pp 19–23

  91. Zaniolo C (1980) Mixed transitivity for functional and multivalued dependencies in database relations. Inf Process Lett 10(1):32–34

    Article  MATH  Google Scholar 

  92. Zaniolo C (1984) Database relations with null values. J Comput Syst Sci 28(1):142–166

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastian Link.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Link, S. Entity integrity management under data volume, variety and veracity. Knowl Inf Syst 65, 2895–2934 (2023). https://doi.org/10.1007/s10115-022-01814-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01814-1

Keywords

Navigation