research-article

Transducing Markov sequences

Authors:

Benny Kimelfeld,

Christopher RéAuthors Info & Claims

PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 15 - 26

https://doi.org/10.1145/1807085.1807090

Published: 06 June 2010 Publication History

Abstract

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any sub-exponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

References

[1]

A. J. Bonner and G. Mecca. Sequences, datalog, and transducers. J. Comput. Syst. Sci., 57(3):234--259, 1998.

Digital Library

[2]

A. J. Bonner and G. Mecca. Querying sequence databases with transducers. Acta Inf., 36(7):511--544, 2000.

Digital Library

[3]

J. Boulos, N. N. Dalvi, B. Mandhani, S. Mathur, C. Ré, and D.Suciu. MYSTIQ: a system for finding more answers by using probabilities. In SIGMOD Conference, pages 891--893, 2005.

Digital Library

[4]

M.-Y. Chen, A. Kundu, and J. Zhou. Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):481--496, 1994.

Digital Library

[5]

R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD Conference, pages 551--562, 2003.

Digital Library

[6]

S. Cohen, B. Kimelfeld, and Y. Sagiv. Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci., 74(7):1147--1159, 2008.

Digital Library

[7]

S. Cohen, B. Kimelfeld, and Y. Sagiv. Running tree automata on probabilistic XML. In PODS, pages 227--236. ACM, 2009.

Digital Library

[8]

N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875, 2004.

Digital Library

[9]

N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007.

Digital Library

[10]

A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004.

Digital Library

[11]

Y. Diao, B. Li, A. Liu, L. Peng, C. Sutton, T. Tran, and M. Zink. Capturing data uncertainty in high-volume stream processing. In CIDR, 2009.

[12]

R. G. Downey and M. R. Fellows. Fixed-parameter tractability and completeness I: Basic results. SIAM Journal on Computing, 24(4):873--921, 1995.

Digital Library

[13]

R. Durbin, S. R. Eddy, A. Krogh, and G. J. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.

[14]

D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998.

Digital Library

[15]

B. Escoffier and V. T. Paschos. Differential approximation of min sat, max sat and related problems. In ICCSA (4), volume 3483 of Lecture Notes in Computer Science, pages 192--201. Springer, 2005.

Digital Library

[16]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003.

Digital Library

[17]

C. H. Fosgate, H. Krim, W. W. Irving, W. C. Karl, and A. S. Willsky. Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Transactions on Image Processing, 6(1):7--20, 1997.

Digital Library

[18]

J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. Mcarthur, S. Minton, I. Xheneti, A. Toncheva, and A. Manfrediz. The expanding digital universe: A forecast of worldwide information growth through 2010, March 2007.

[19]

J. Håstad. Clique is hard to approximate within n^1-ε. In FOCS, pages 627--636. IEEE Computer Society, 1996.

Digital Library

[20]

HMMER. Biosequence analysis using hidden Markov models. http://hmmer.janelia.org/.

[21]

HTK. The hidden Markov toolkit, October 2009. http://htk.eng.cam.ac.uk/.

[22]

J. Huang, L. Antova, C. Koch, and D. Olteanu. MayBMS: a probabilistic database management system. In SIGMOD Conference, pages 1071--1074, 2009.

Digital Library

[23]

G. Jirásková. State complexity of some operations on binary regular languages. Theor. Comput. Sci., 330(2):287--298, 2005.

Digital Library

[24]

D. Johnson, M. Yannakakis, and C. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27:119--123, 1988.

Digital Library

[25]

B. Kanagal and A. Deshpande. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE, pages 1160--1169, 2008.

Digital Library

[26]

B. Kanagal and A. Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In ICDE, pages 1315--1318. IEEE, 2009.

Digital Library

[27]

B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD Conference, pages 455--468, 2009.

Digital Library

[28]

S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting and random generation of strings in regular languages. In SODA, pages 551--557, 1995.

Digital Library

[29]

A. Kempe. Finite state transducers approximating hidden Markov models. In ACL, pages 460--467, 1997.

Digital Library

[30]

B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query efficiency in probabilistic XML models. In SIGMOD Conference, pages 701--714, 2008.

Digital Library

[31]

B. Kimelfeld and C. Ré. Transducing Markov sequences (extended version). Accessible at the second author's home page (http://pages.cs.wisc.edu/ chrisre/), 2010.

[32]

B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173--182. ACM, 2006.

Digital Library

[33]

B. Kimelfeld and Y. Sagiv. Maximally joining probabilistic data. In PODS, pages 303--312. ACM, 2007.

Digital Library

[34]

B. Kimelfeld and Y. Sagiv. Efficiently enumerating results of keyword search over data graphs. Inf. Syst., 33(4-5):335--359, 2008.

Digital Library

[35]

C. Koch. Approximating predicates and expressive queries on probabilistic databases. In PODS, pages 99--108, 2008.

Digital Library

[36]

C. Koch. A compositional query algebra for second-order logic and uncertain databases. In ICDT, pages 127--140, 2009.

Digital Library

[37]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.

Digital Library

[38]

E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18:401--405, 1972.

Digital Library

[39]

J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Access methods for Markovian streams. In ICDE, pages 246--257, 2009.

Digital Library

[40]

J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Lahar demonstration: Warehousing Markovian streams. PVLDB, 2(2):1610--1613, 2009.

Digital Library

[41]

B. Ludascher, P. Mukhopadhyay, and Y. Papakonstantinou. A transducer-based XML query processor. In VLDB, pages 227--238. Morgan Kaufmann, 2002.

Digital Library

[42]

W. Martens and F. Neven. Typechecking top-down uniform unranked tree transducers. In ICDT, volume 2572 of Lecture Notes in Computer Science, pages 64--78. Springer, 2003.

Digital Library

[43]

K. G. Murty. An algorithm for ranking all the assignments in order of increasing costs. Operations Research, 16:682--687, 1968.

Digital Library

[44]

C. H. Papadimitriou and M. Yannakakis. On the complexity of database queries. Journal of Computer and System Sciences, 58(3):407--427, 1999.

Digital Library

[45]

J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4):777--788, 1983.

Digital Library

[46]

L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257--286, 1989.

[47]

C. Ré, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Conference, pages 715--728, 2008.

Digital Library

[48]

C. Ré and D. Suciu. Approximate lineage for probabilistic databases. PVLDB, 1(1):797--808, 2008.

Digital Library

[49]

A. D. Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In ICDE, pages 1023--1032, 2008.

Digital Library

[50]

P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232--239. IEEE Computer Society, 1995.

Digital Library

[51]

F. Sha and F. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, 2003.

Digital Library

[52]

F. Sha and L. K. Saul. Large margin hidden Markov models for automatic speech recognition. In NIPS, pages 1249--1256, 2006.

[53]

S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. E. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, pages 1053--1061, 2008.

Digital Library

[54]

S. Toda and M. Ogiwara. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput., 21(2):316--328, 1992.

Digital Library

[55]

T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. J. Shenoy. Probabilistic inference over RFID streams in mobile environments. In ICDE, pages 1096--1107, 2009.

Digital Library

[56]

L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.

[57]

M. Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages 137--146. ACM, 1982.

Digital Library

[58]

J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.

[59]

J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 17:712--716, 1971.

Digital Library

[60]

S. Zachos. Probabilistic quantifiers and games. Journal of Computer and System Sciences, 36(3):433--451, 1988.

Digital Library

Cited By

Wang JCheng LLiu J(2014)A Complex Event Detection Method for Multi-probability RFID Event StreamJournal of Software10.4304/jsw.9.4.834-8409:4Online publication date: 1-Apr-2014
https://doi.org/10.4304/jsw.9.4.834-840
Kimelfeld BRé C(2014)Transducing Markov sequencesJournal of the ACM10.1145/263006561:5(1-48)Online publication date: 8-Sep-2014
https://dl.acm.org/doi/10.1145/2630065
Kimelfeld BHull RGrohe M(2014)Database principles in information extractionProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/2594538.2594563(156-163)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2594538.2594563
Show More Cited By

Index Terms

Transducing Markov sequences
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Transducing Markov sequences

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and ...
Semi-hidden Markov models for generation and analysis of sequences

In this work a new kind of stochastic model is presented, the semi-hidden Markov model (SHMM). The proposed model is related to the hidden Markov model (HMM), and it is called semi-hidden because generated sequences need less information than HMM ...
Coding with partially hidden Markov models
DCC '95: Proceedings of the Conference on Data Compression

Partially hidden Markov models (PHMM) are introduced. They are a variation of the hidden Markov models (HMM) combining the power of explicit conditioning on past observations and the power of using hidden states. (P)HMM may be combined with arithmetic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

June 2010

350 pages

ISBN:9781450300339

DOI:10.1145/1807085

General Chair:
Jan Paredaens
University of Antwerp, Belgium
,
Program Chair:
Dirk Van Gucht
Indiana University, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 11, 2010

Indiana, Indianapolis, USA

Acceptance Rates

PODS '10 Paper Acceptance Rate 27 of 113 submissions, 24%;

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
259
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang JCheng LLiu J(2014)A Complex Event Detection Method for Multi-probability RFID Event StreamJournal of Software10.4304/jsw.9.4.834-8409:4Online publication date: 1-Apr-2014
https://doi.org/10.4304/jsw.9.4.834-840
Kimelfeld BRé C(2014)Transducing Markov sequencesJournal of the ACM10.1145/263006561:5(1-48)Online publication date: 8-Sep-2014
https://dl.acm.org/doi/10.1145/2630065
Kimelfeld BHull RGrohe M(2014)Database principles in information extractionProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/2594538.2594563(156-163)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2594538.2594563
Benedikt MOlteanu D(2013)Report on the first workshop on innovative querying of streamsACM SIGMOD Record10.1145/2503792.250380742:2(59-63)Online publication date: 16-Jul-2013
https://dl.acm.org/doi/10.1145/2503792.2503807
Golenberg KKimelfeld BSagiv Y(2011)Optimizing and parallelizing ranked enumerationProceedings of the VLDB Endowment10.14778/3402707.34027394:11(1028-1039)Online publication date: 1-Aug-2011
https://dl.acm.org/doi/10.14778/3402707.3402739
Kumar ARé C(2011)Probabilistic management of OCR data using an RDBMSProceedings of the VLDB Endowment10.14778/2095686.20956915:4(322-333)Online publication date: 1-Dec-2011
https://dl.acm.org/doi/10.14778/2095686.2095691
Letchner JBalazinska M(2011)Lineage for Markovian stream event queriesProceedings of the 10th ACM International Workshop on Data Engineering for Wireless and Mobile Access10.1145/1999309.1999317(26-33)Online publication date: 12-Jun-2011
https://dl.acm.org/doi/10.1145/1999309.1999317
Li JSaha BDeshpande A(2011)A unified approach to ranking in probabilistic databasesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-011-0220-320:2(249-275)Online publication date: 1-Apr-2011
https://dl.acm.org/doi/10.1007/s00778-011-0220-3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents