Article

Estimating recall and precision for vague queries in databases

Authors:

Raquel Kolitski Stasiu,

Carlos A. Heuser,

Roberto da SilvaAuthors Info & Claims

CAiSE'05: Proceedings of the 17th international conference on Advanced Information Systems Engineering

Pages 187 - 200

https://doi.org/10.1007/11431855_14

Published: 13 June 2005 Publication History

Abstract

In vague queries, a user enters a value that represents some real world object and expects as the result the set of database values that represent this real world object even with not exact matching. The problem appears in databases that collect data from different sources or databases were different users enter data directly. Query engines usually rely on the use of some type of similarity metric to support data with inexact matching. The problem of building query engines to execute vague queries has been already studied, but an important problem still remains open, namely that of defining the threshold to be used when a similarity scan is performed over a database column. From the bibliography it is known that the threshold depends on the similarity metrics and also on the set of values being queried. Thus, it is unrealistic to expect that the user supplies a threshold at query time. In this paper we propose a process for estimation of recall/precision values for several thresholds for a database column. The idea is that this process is started by a database administrator in a pre-processing phase using samples extracted from database. The meta-data collected by this process may be used in query processing in the optimization phase. The paper describes this process as well as experiments that were performed in order to evaluate it.

References

[1]

Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33 (2001) 31-88.

Digital Library

[2]

Santini, S., Jain, R.: Similarity measures. IEEE Transaction on Pattern Analysis and Machine Intelligence 21 (1999) 871-883.

Digital Library

[3]

Schallehn, E., Sattler, K.U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48 (2004) 361-387.

Digital Library

[4]

Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the Twelfth International Conference on World Wide Web, ACM Press (2003) 90-101.

Digital Library

[5]

Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D., Muthukrishnan, S.: Approximate string joins in a database (almost) for free. In: Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, VLDB 2001, Morgan Kaufmann (2001) 491- 500.

Digital Library

[6]

Schallehn, E., Geist, I., Sattler, K.U.: Supporting similarity operations based on approximate string matching on the web. In Meersman, R., Tari, Z., eds.: CoopIS/DOA/ODBASE (1). Volume 3290 of Lecture Notes in Computer Science., Springer (2004) 227-244.

[7]

Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24 (2001) 19-27.

[8]

Motro, A.: Vague: a user interface to relational databases that permits vague queries. ACM Trans. Inf. Syst. 6 (1988) 187-214.

Digital Library

[9]

Ortega-Binderberger, M.: Integrating Similarity Based Retrieval and Query Refinement in Databases. Phd thesis, UIUC - University of Illinois at Urbana-Champaign, Urbana, Illinois (2002).

Digital Library

[10]

Baeza-Yates, R.A., Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1999).

Digital Library

[11]

Ullman, J.D., Garcia-Molina, H., Widom, J.: Database Systems: The Complete Book. Prentice Hall Inc., Upper Saddle River, New Jersey, USA (2002).

Digital Library

[12]

Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 1-3, 1998, Seattle, Washington, ACM Press (1998) 34-43.

Digital Library

[13]

Nambiar, U., Kambhampati, S.: Answering imprecise database queries: a novel approach. In: Proceedings of the fifth ACM international workshop on web information and data management, ACM Press (2003) 126-133.

Digital Library

[14]

Dey, D., Sarkar, S.: A probabilistic relational model and algebra. ACM Trans. Database Syst. 21 (1996) 339-369.

Digital Library

[15]

Fuhr, N., Rolleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15 (1997) 32-66.

Digital Library

[16]

Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, ACM Press (1998) 201-212.

Digital Library

[17]

de Keijzer, A., van Keulen, M.: A possible world approach to uncertain relational data. In: 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) Workshops. SIUFDB-04 1st International Workshop on Supporting Imprecision and Uncertainty in Flexible Databases, Zaragoza, Spain, September 3, 2004, IEEE Computer Society (2004).

Digital Library

[18]

Lakshmanan, L.V.S., Leone, N., Ross, R., Subrahmanian, V.S.: Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22 (1997) 419-469.

Digital Library

[19]

Fuhr, N.: A probabilistic relational model for the integration of ir and databases. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press (1993) 309-317.

Digital Library

[20]

Barbara, D., Garcia-Molina, H., Porter, D.: A probabilistic relational data model. In: Proceedings of the international conference on extending database technology on Advances in database technology, Springer-Verlag New York, Inc. (1990) 60-74.

Digital Library

[21]

List, J., Mihajlovic, V., de Vries, A.P., Ramirez, G., Hiemstra, D.: The TIJAH XML-IR system at INEX 2003. Proceedings of the 2nd Initiative on the Evaluation of XML Retrieval (INEX 2003), ERCIM Workshop Proceedings (2003).

[22]

Consens, M.P., Milo, T.: Algebras for querying text regions: expressive power and optimization. J. Comput. Syst. Sci. 57 (1998) 272-288.

Digital Library

[23]

Fuhr, N., Grossjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press (2001) 172-180.

Digital Library

[24]

Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, ACM Press (2003) 301-312.

Digital Library

[25]

Ortega, M., Chakrabarti, K., Mehrotra, S.: Efficient evaluation of relevance feedback for multidimensional all-pairs retrieval. In: Proceedings of the 2003 ACM Symposium on Applied computing, SAC 2003, ACM Press (2003) 847-852.

Digital Library

[26]

Chakrabarti, K., Ortega-Binderberger, M., Mehrotra, S., Porkaew, K.: Evaluating refined queries in top-k retrieval systems. IEEE Transactions on Knowledge and Data Engineering 16 (2004) 256-270.

Digital Library

[27]

Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18 (2000) 288-321.

Digital Library

[28]

Cohen, W.W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, Morgan Kaufmann (2003) 73-78.

[29]

Schallehn, E., Sattler, K.U.: Using similarity-based operations for resolving data-level conflicts. In: BNCOD. (2003) 172-189.

Digital Library

[30]

Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In Bernstein, P.A., ed.: Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, May 30 - June 1, ACM (1979) 23-34.

Digital Library

[31]

Guth, G.J.: Surname spellings and computerized record linkage. Historical Methods Newsletter 10 (1976) 10-19.

[32]

Dorneles, C.F., Lima, A.E.N., Heuser, C.A., da Silva, A., Moura, E.: Measuring similarity between collection of values. In: Proceedings of 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), Washington DC, USA, ACM Press (2004) 56-63.

Digital Library

[33]

Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, Inc., New York, NY, USA (1975).

Digital Library

[34]

Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30-34.

Cited By

Bravo Balado Ade Boer VSchreiber G(2014)Linking Historical Ship Records to a Newspaper ArchiveSocial Informatics10.1007/978-3-319-15168-7_32(254-263)Online publication date: 10-Nov-2014
https://dl.acm.org/doi/10.1007/978-3-319-15168-7_32
Islam MLiu CZhou R(2012)On modeling query refinement by capturing user intent through feedbackProceedings of the Twenty-Third Australasian Database Conference - Volume 12410.5555/2483739.2483743(11-20)Online publication date: 31-Jan-2012
https://dl.acm.org/doi/10.5555/2483739.2483743
dos Santos JHeuser CMoreira VWives L(2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1016/j.ins.2010.05.029
Show More Cited By

Index Terms

Estimating recall and precision for vague queries in databases

Recommendations

Top-k dominating queries in uncertain databases
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate ...
Reliability of Answers to Queries in Relational Databases

The author studies the problem of determining the reliability of answers to queries in a relational database system, where the information in the database comes from various sources with varying degrees of reliability. An extended relational model is ...
Probabilistic top-k dominating queries in uncertain databases

Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

CAiSE'05: Proceedings of the 17th international conference on Advanced Information Systems Engineering

June 2005

583 pages

ISBN:3540260951

Editors:
Oscar Pastor
Department of Information Systems and Computation, Technical University of Valencia, Camino de Vera s/n, Valencia, Spain
,
João Falcão e Cunha
Department of Information Systems and Computation, Faculdade de Engenharia da Universidade do Porto, Camino de Vera s/n, Porto, Portugal

Sponsors

FCT: Foundation for Science and Technology
Universidad Politécnica de Valencia, Spain
University of Porto
FEUP: Faculdade de Engenharia da Univ. do Porto
ERCIM: European Research Consortium for Informatics & Mathematics

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 13 June 2005

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bravo Balado Ade Boer VSchreiber G(2014)Linking Historical Ship Records to a Newspaper ArchiveSocial Informatics10.1007/978-3-319-15168-7_32(254-263)Online publication date: 10-Nov-2014
https://dl.acm.org/doi/10.1007/978-3-319-15168-7_32
Islam MLiu CZhou R(2012)On modeling query refinement by capturing user intent through feedbackProceedings of the Twenty-Third Australasian Database Conference - Volume 12410.5555/2483739.2483743(11-20)Online publication date: 31-Jan-2012
https://dl.acm.org/doi/10.5555/2483739.2483743
dos Santos JHeuser CMoreira VWives L(2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1016/j.ins.2010.05.029
dos Santos JHeuser COrengo VWives Lda Silva Torres RRubira C(2008)Automatic threshold estimation for data matching applicationsProceedings of the 23rd Brazilian symposium on Databases10.5555/1498932.1498943(106-119)Online publication date: 13-Oct-2008
https://dl.acm.org/doi/10.5555/1498932.1498943
Borges EGalante RGonçalves Mda Silva Torres RRubira C(2008)Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitaisProceedings of the 23rd Brazilian symposium on Databases10.5555/1498932.1498941(76-90)Online publication date: 13-Oct-2008
https://dl.acm.org/doi/10.5555/1498932.1498941
Dorneles CHeuser COrengo Vda Silva Ade Moura ELaender AFalcão AOlsen ØSilva MBaeza-Yates RMcGuinness DOlstad B(2007)A strategy for allowing meaningful and comparable scores in approximate matchingProceedings of the sixteenth ACM conference on Conference on information and knowledge management10.1145/1321440.1321484(303-312)Online publication date: 6-Nov-2007
https://dl.acm.org/doi/10.1145/1321440.1321484
de Brum Saccol DEdelweiss Nde Matos Galante RZaniolo CKing PSimske S(2007)XML version detectionProceedings of the 2007 ACM symposium on Document engineering10.1145/1284420.1284441(79-88)Online publication date: 28-Aug-2007
https://dl.acm.org/doi/10.1145/1284420.1284441
Kotsakis E(2006)XML fuzzy rankingProceedings of the 7th international conference on Flexible Query Answering Systems10.1007/11766254_14(159-169)Online publication date: 7-Jun-2006
https://dl.acm.org/doi/10.1007/11766254_14

View Options

View options

Media

Figures

Other

Tables

View Table of Contents