Abstract
We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of “virtual records”, where attribute values of a “virtual record” can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each “virtual record” and the query can be effectively quantified, and the maximum similarity of its “virtual records” is taken as the similarity of an entity. For each entity, as the number of its “virtual records” is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.
Similar content being viewed by others
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597 (2002)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Behm, A., Ji, S., Li, C., Lu, J.: Space-constrained gram-based indexing for efficient approximate string search. In: ICDE, pp. 604–615 (2009)
Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. Stanford University (2005)
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. J. IEEE Data Eng. Bull. 29(2), 13–20 (2006)
Bergamaschi, S., Gelati, G., Guerra, F., Vincini, M.: An intelligent data integration approach for collaborative project management in virtual enterprises. World Wide Web 9(1), 35–61 (2006)
Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical report, University of Texas, Austin (2002)
Brouwer, A.E., Cohen, A.M., Neumaier, A.: Distance-Regular Graphs. Springer, Berlin Heidelberg New York (1989)
Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M., Srivastava, D.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (2007)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 5 (2006)
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB, pp. 327–338 (2007)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)
Date, C.J., Darwen, H., Lorentzos, N.: Temporal Data & the Relational Model. Elsevier’s Science & Technology (2002)
Do, H.-H., Rahm, E.: COMA–a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW, pp. 90–101 (2003)
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB, pp. 636–647 (2004)
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)
Harary, F.: Graph Theory. Addison-Wesley, Reading (1994)
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. J. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
Kappel, G., Kapsammeri, E., Retschitzegger, W.: Integrating XML and relational database systems. World Wide Web 7(4), 343–384 (2004)
Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: VLDB, pp. 1078–1086 (2004)
Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)
Li, C., Wang, B., Yang, X.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
On, B.-W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: ICDE, pp. 496–505 (2007)
Pak, A.N., Chung, C.-W.: A wikipedia matching approach to contextual advertising. World Wide Web 13(3), 251–274 (2010)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)
Stonebraker, M.: The design of the postgres storage system. In: VLDB, pp. 289–300 (1987)
Tejada, S., Knoblock, C., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: SIGKDD, pp. 350–359 (2002)
Turn, P.: Onan extremal problem in graph theory. Journal of Matematiko Fizicki Lapok (in Hungarian) (1941)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann (1979)
Vernicaand, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: KEYS, pp. 9 (2009)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)
Winkler, W.E.: The state of record linkage and current research problems. US Bureau of the Census (1999)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB, pp. 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, pp. 353–364 (2008)
Yin, X., Han, J., Yu, P.S.: LinkClus: efficient clustering via heterogeneous semantic links. In: VLDB, pp. 427–438 (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, W., Fung, G.P.C., Du, X. et al. Approximate entity extraction in temporal databases. World Wide Web 14, 157–186 (2011). https://doi.org/10.1007/s11280-011-0109-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-011-0109-5