Nothing Special   »   [go: up one dir, main page]

skip to main content
article

A framework for abstracting data sources having heterogeneous representation formats

Published: 01 January 2004 Publication History

Abstract

This paper deals with the issue of abstracting a data source characterized by one among several possible representation formats. First we show that data source abstraction plays a central role in several important application problems in the area of information system design. Then we propose a new approach which is capable of semi-automatically carrying out the abstraction of a data source possibly encoded according to one among a variety of formats such as structured databases, OEM graphs and XML documents. The capability to handle heterogeneous formats is obtained via the usage of a particular conceptual model, called SDR-Network, which is able to uniformly represent and handle data sources with different formats. As a significant application of the presented data source abstraction algorithm, the construction of an Intensional Repository is also illustrated.

References

[1]
{1} S. Babu, M. Garofalakis, R. Rastogi, SPARTAN: using constrained models for guaranteed-error semantic compression, ACM SIGKDD Explorations Newsletter 4 (1) (2002) 11-20.
[2]
{2} C. Batini, S. Castano, V. De Antonellis, M.G. Fugini, B. Pernici, Analysis of an inventory of information systems in the public administration, Requirement Engineering Journal 1 (1) (1996) 47-62.
[3]
{3} C. Batini, M. Lenzerini, A methodology for data schema integration in the entity relationship model, IEEE Transactions on Software Engineering 10 (6) (1984) 650-664.
[4]
{4} S. Bergamaschi, S. Castano, M. Vincini, Semantic integration of semistructured and structured data sources, SIGMOD Record 28 (1) (1999) 54-59.
[5]
{5} S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano, Semantic integration and query of heterogeneous information sources, Data & Knowledge Engineering 36 (3) (2001) 215-249.
[6]
{6} A.L. Berger, V.O. Mittal, Ocelot: a system for summarizing web pages, in: Proceedings of Annual Conference on Research and Development in Information Retrieval (SIGIR'00), New York, USA, ACM Press, 2000, pp. 144-151.
[7]
{7} P. Buneman, S. Davidson, M. Fernandez, D. Suciu, Adding structure to unstructured data, in: Proceedings of International Conference on Database Theory (ICDT'97) Delphi, Greece, Lecture Notes in Computer Science, Springer-Verlag, 1997, pp. 336-350.
[8]
{8} O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A. Paepcke, T. Winograd, Efficient web browsing on handheld devices using page and form summarization, ACM Transactions on Information Systems (TOIS) 20 (1) (2002) 82-115.
[9]
{9} D. Calvanese, G. De Giacomo, M. Lenzerini, Modeling and querying semi-structured data, Networking and Information Systems Journal 2 (2) (1999) 253-273.
[10]
{10} D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati, Description logic framework for information integration, in: Proceedings of International Conference on Principles of Knowledge Representation and Reasoning (KR'98), Trento, Italy, Morgan Kaufman, 1998, pp. 2-13.
[11]
{11} M. Cannataro, A. Guzzo, A. Pugliese, Knowledge management and XML: derivation of synthetic views over semistructured data, ACM SIGAPP Applied Computing Review 10 (1) (2002) 33-36.
[12]
{12} S. Castano, V. De Antonellis, S. De Capitani di Vimercati, Global viewing of heterogeneous data sources, Transactions on Data and Knowledge Engineering 13 (2) (2001) 277-297.
[13]
{13} S. Castano, V. De Antonellis, M.G. Fugini, B. Pernici, Conceptual schema analysis: Techniques and applications, ACM Transactions on Database Systems (TODS) 23 (3) (1998) 286-332.
[14]
{14} R. Cattel, The Object Data Standard: ODMG 2.0, Morgan Kaufmann, 1997.
[15]
{15} S. Comai, E. Damiani, P. Fraternali, Computing graphical queries over XML data, ACM Transactions on Information Systems 19 (4) (2001) 371-430.
[16]
{16} A. Doan, P. Domingos, A. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, in: Proceedings of International Conference on Management of Data (SIGMOD 2001), Santa Barbara, California, USA, ACM Press, 2001, pp. 509-520.
[17]
{17} A. Fox, S.D. Gribble, E.A. Brewer, E. Amir, Adapting to network and client variability via on-demand dynamic distillation, ACM SIGOPS Operating Systems Review 30 (5) (1996) 160-170.
[18]
{18} Z. Galil, Efficient algorithms for finding maximum matching in graphs, ACM Computing Surveys 18 (1986) 23-38.
[19]
{19} R. Goldman, J. McHugh, J. Widom, From semistructured data to XML: migrating the lore data model and query languages, in: Proceedings of International Workshop on the Web and Databases (WebDB'99), Philadelphia, Pennsylvania, USA, 1999, pp. 25-30.
[20]
{20} R. Goldman, J. Widom, Dataguides: enabling query formulation and optimization in semistructured databases, in: Proceedings of Very Large Data Bases (VLDB'97), Athens, Greece, Morgan Kaufman, 1997, pp. 436-445.
[21]
{21} J. Han, Y. Cai, N. Cercone, Knowledge discovery in databases: an attribute-oriented approach, in: in Proceedings of International Conference on Very Large Data Bases (VLDB'92), Vancouver, Canada, Morgan Kaufmann, 1992, pp. 547-559.
[22]
{22} A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, 1988.
[23]
{23} L. Kaufman, P.J. Rousseeuw, Findings Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons, New York, 1990.
[24]
{24} D.H. Lee, M.H. Kim, Database summarization using fuzzy isa hierarchies, IEEE Transactions on Systems, Man, and Cybernetics--Part B: Cybernetics 27 (4) (1997) 671-680.
[25]
{25} A. Levy, A. Rajaraman, J. Ordille, Querying heterogeneous information sources using source descriptions, in: Proceedings of International Conference on Very Large Data Bases (VLDB'96), Bombay, India, Morgan Kaufmann, 1996, pp. 251-262.
[26]
{26} J. Madhavan, P.A. Bernstein, E. Rahm, Generic schema matching with cupid, in: Proceedings of International Conference on Very Large Data Bases (VLDB'2001), Roma, Italy, Morgan Kaufmann, 2001, pp. 49-58.
[27]
{27} T. Milo, S. Zohar, Using schema matching to simplify heterogenous data translations, in: Proceedings of International Conference on Very Large Data Bases (VLDB'98), New York City, USA, Morgan Kaufmann, 1998, pp. 122-133.
[28]
{28} P. Mitra, G. Wiederhold, J. Jannink, Semi-automatic integration of knowledge sources, in: Proceedings of Fusion'99, Sunnyvale, California, USA, 1999.
[29]
{29} D.A. Nation, C. Plaisant, G. Marchionini, A. Komlodi, Visualizing web sites using a hierarchical table of contents browser: Webtoc, in: Proceedings of Conference on Human Factors and the Web, Denver, Colorado, USA, US West Communications, 1997.
[30]
{30} L. Palopoli, L. Pontieri, G. Terracina, D. Ursino, Intensional and extensional integration and abstraction of heterogeneous databases, Data & Knowledge Engineering 35 (3) (2000) 201-237.
[31]
{31} L. Palopoli, D. Rosaci, G. Terracina, D. Ursino, Un modello concettuale per rappresentare e derivare la semantica associata a sorgenti informative strutturate e semi-strutturate, in: Atti del Congresso sui Sistemi Evoluti per Basi di Dati (SEBD 2001), Venezia, Italy, 2001, pp. 131-145 (in Italian).
[32]
{32} L. Palopoli, G. Terracina, D. Ursino, A graph-based approach for extracting terminological properties of elements of XML documents, in: Proceedings of International Conference on Data Engineering (ICDE'2001), Heidelberg, Germany, IEEE Computer Society, 2001, pp. 330-337.
[33]
{33} Y. Papakonstantinou, H. Garcia-Molina, J. Widom, Object exchange across heterogeneous information sources, in: Proceedings of International Conference on Data Engineering (ICDE'95), Taipei, Taiwan, IEEE Computer Society, 1995, pp. 251-260.
[34]
{34} D. Rosaci, G. Terracina, D. Ursino, An approach for deriving a global representation of data sources having different formats and structures, Knowledge and Information Systems, in press.
[35]
{35} G. Terracina, D. Ursino, Deriving synonymies and homonymies of object classes in semi-structured information sources, in: Proceedings of International Conference on Management of Data (COMAD'2000), Pune, India, McGraw Hill, 2000, pp. 21-32.
[36]
{36} I.H. Witten, T.C. Bell, A. Moffat, C.G. Nevill-Manning, T.C. Smith, H.W. Thimbleby, Semantic and generative models for lossy text compression, The Computer Journal 37 (2) (1994) 83-87.

Cited By

View all

Index Terms

  1. A framework for abstracting data sources having heterogeneous representation formats

    Recommendations

    Reviews

    Jonathan P. E. Hodgson

    A major problem in information management is combining data that comes from heterogeneous sources. The issue here is not reconciling information from two sources, important though that may be. Rather, the goal is to embed disparate data representations into a common framework. This paper describes a conceptual model, called a semantic distance and relevance network (SDR network), that can be used to incorporate data from heterogeneous, but structured sources. The idea is as follows: given data in the form of some structured source, such as a relational database or an Extensible Markup Language (XML) document, one can construct an SDR network. Links between concepts in these networks carry weights that measure semantic distance and relevance. The authors propose an abstraction scheme that can be applied to these networks, in which some nodes and arcs are absorbed into others, but with a preservation of the fact of absorption. This is needed for any subsequent expansion of the SDR network. The authors suggest that SDR networks be integrated into an "intentional repository," by applying a suitable clustering algorithm based on measuring the similarities between SDR networks. In fact, they claim that one can construct a hierarchy of clusters that would be suitable for browsing the repository. The authors illustrate the construction of SDR networks from XML documents, and from databases, and give an example of the abstraction process. There are indications of the time and space complexity of the algorithms, but they do not appear to have been implemented. The ideas in the paper are interesting, and are worth pursuing. Perhaps the best reader for this paper would be someone looking to implement a system based on heterogeneous resources. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Data & Knowledge Engineering
    Data & Knowledge Engineering  Volume 48, Issue 1
    January 2004
    150 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 January 2004

    Author Tags

    1. inter-scheme properties
    2. scheme abstraction
    3. semi-structured information sources
    4. source summarization

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media