Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/11431855_14guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Estimating recall and precision for vague queries in databases

Published: 13 June 2005 Publication History

Abstract

In vague queries, a user enters a value that represents some real world object and expects as the result the set of database values that represent this real world object even with not exact matching. The problem appears in databases that collect data from different sources or databases were different users enter data directly. Query engines usually rely on the use of some type of similarity metric to support data with inexact matching. The problem of building query engines to execute vague queries has been already studied, but an important problem still remains open, namely that of defining the threshold to be used when a similarity scan is performed over a database column. From the bibliography it is known that the threshold depends on the similarity metrics and also on the set of values being queried. Thus, it is unrealistic to expect that the user supplies a threshold at query time. In this paper we propose a process for estimation of recall/precision values for several thresholds for a database column. The idea is that this process is started by a database administrator in a pre-processing phase using samples extracted from database. The meta-data collected by this process may be used in query processing in the optimization phase. The paper describes this process as well as experiments that were performed in order to evaluate it.

References

[1]
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33 (2001) 31-88.
[2]
Santini, S., Jain, R.: Similarity measures. IEEE Transaction on Pattern Analysis and Machine Intelligence 21 (1999) 871-883.
[3]
Schallehn, E., Sattler, K.U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48 (2004) 361-387.
[4]
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the Twelfth International Conference on World Wide Web, ACM Press (2003) 90-101.
[5]
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D., Muthukrishnan, S.: Approximate string joins in a database (almost) for free. In: Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, VLDB 2001, Morgan Kaufmann (2001) 491- 500.
[6]
Schallehn, E., Geist, I., Sattler, K.U.: Supporting similarity operations based on approximate string matching on the web. In Meersman, R., Tari, Z., eds.: CoopIS/DOA/ODBASE (1). Volume 3290 of Lecture Notes in Computer Science., Springer (2004) 227-244.
[7]
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24 (2001) 19-27.
[8]
Motro, A.: Vague: a user interface to relational databases that permits vague queries. ACM Trans. Inf. Syst. 6 (1988) 187-214.
[9]
Ortega-Binderberger, M.: Integrating Similarity Based Retrieval and Query Refinement in Databases. Phd thesis, UIUC - University of Illinois at Urbana-Champaign, Urbana, Illinois (2002).
[10]
Baeza-Yates, R.A., Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999).
[11]
Ullman, J.D., Garcia-Molina, H., Widom, J.: Database Systems: The Complete Book. Prentice Hall Inc., Upper Saddle River, New Jersey, USA (2002).
[12]
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 1-3, 1998, Seattle, Washington, ACM Press (1998) 34-43.
[13]
Nambiar, U., Kambhampati, S.: Answering imprecise database queries: a novel approach. In: Proceedings of the fifth ACM international workshop on web information and data management, ACM Press (2003) 126-133.
[14]
Dey, D., Sarkar, S.: A probabilistic relational model and algebra. ACM Trans. Database Syst. 21 (1996) 339-369.
[15]
Fuhr, N., Rolleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15 (1997) 32-66.
[16]
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, ACM Press (1998) 201-212.
[17]
de Keijzer, A., van Keulen, M.: A possible world approach to uncertain relational data. In: 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) Workshops. SIUFDB-04 1st International Workshop on Supporting Imprecision and Uncertainty in Flexible Databases, Zaragoza, Spain, September 3, 2004, IEEE Computer Society (2004).
[18]
Lakshmanan, L.V.S., Leone, N., Ross, R., Subrahmanian, V.S.: Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22 (1997) 419-469.
[19]
Fuhr, N.: A probabilistic relational model for the integration of ir and databases. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press (1993) 309-317.
[20]
Barbara, D., Garcia-Molina, H., Porter, D.: A probabilistic relational data model. In: Proceedings of the international conference on extending database technology on Advances in database technology, Springer-Verlag New York, Inc. (1990) 60-74.
[21]
List, J., Mihajlovic, V., de Vries, A.P., Ramirez, G., Hiemstra, D.: The TIJAH XML-IR system at INEX 2003. Proceedings of the 2nd Initiative on the Evaluation of XML Retrieval (INEX 2003), ERCIM Workshop Proceedings (2003).
[22]
Consens, M.P., Milo, T.: Algebras for querying text regions: expressive power and optimization. J. Comput. Syst. Sci. 57 (1998) 272-288.
[23]
Fuhr, N., Grossjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press (2001) 172-180.
[24]
Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, ACM Press (2003) 301-312.
[25]
Ortega, M., Chakrabarti, K., Mehrotra, S.: Efficient evaluation of relevance feedback for multidimensional all-pairs retrieval. In: Proceedings of the 2003 ACM Symposium on Applied computing, SAC 2003, ACM Press (2003) 847-852.
[26]
Chakrabarti, K., Ortega-Binderberger, M., Mehrotra, S., Porkaew, K.: Evaluating refined queries in top-k retrieval systems. IEEE Transactions on Knowledge and Data Engineering 16 (2004) 256-270.
[27]
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18 (2000) 288-321.
[28]
Cohen, W.W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, Morgan Kaufmann (2003) 73-78.
[29]
Schallehn, E., Sattler, K.U.: Using similarity-based operations for resolving data-level conflicts. In: BNCOD. (2003) 172-189.
[30]
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In Bernstein, P.A., ed.: Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, May 30 - June 1, ACM (1979) 23-34.
[31]
Guth, G.J.: Surname spellings and computerized record linkage. Historical Methods Newsletter 10 (1976) 10-19.
[32]
Dorneles, C.F., Lima, A.E.N., Heuser, C.A., da Silva, A., Moura, E.: Measuring similarity between collection of values. In: Proceedings of 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), Washington DC, USA, ACM Press (2004) 56-63.
[33]
Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, Inc., New York, NY, USA (1975).
[34]
Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30-34.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
CAiSE'05: Proceedings of the 17th international conference on Advanced Information Systems Engineering
June 2005
583 pages
ISBN:3540260951
  • Editors:
  • Oscar Pastor,
  • João Falcão e Cunha

Sponsors

  • FCT: Foundation for Science and Technology
  • Universidad Politécnica de Valencia, Spain
  • University of Porto
  • FEUP: Faculdade de Engenharia da Univ. do Porto
  • ERCIM: European Research Consortium for Informatics & Mathematics

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 13 June 2005

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2014)Linking Historical Ship Records to a Newspaper ArchiveSocial Informatics10.1007/978-3-319-15168-7_32(254-263)Online publication date: 10-Nov-2014
  • (2012)On modeling query refinement by capturing user intent through feedbackProceedings of the Twenty-Third Australasian Database Conference - Volume 12410.5555/2483739.2483743(11-20)Online publication date: 31-Jan-2012
  • (2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
  • (2008)Automatic threshold estimation for data matching applicationsProceedings of the 23rd Brazilian symposium on Databases10.5555/1498932.1498943(106-119)Online publication date: 13-Oct-2008
  • (2008)Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitaisProceedings of the 23rd Brazilian symposium on Databases10.5555/1498932.1498941(76-90)Online publication date: 13-Oct-2008
  • (2007)A strategy for allowing meaningful and comparable scores in approximate matchingProceedings of the sixteenth ACM conference on Conference on information and knowledge management10.1145/1321440.1321484(303-312)Online publication date: 6-Nov-2007
  • (2007)XML version detectionProceedings of the 2007 ACM symposium on Document engineering10.1145/1284420.1284441(79-88)Online publication date: 28-Aug-2007
  • (2006)XML fuzzy rankingProceedings of the 7th international conference on Flexible Query Answering Systems10.1007/11766254_14(159-169)Online publication date: 7-Jun-2006

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media