Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1807085.1807090acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Transducing Markov sequences

Published: 06 June 2010 Publication History

Abstract

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any sub-exponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

References

[1]
A. J. Bonner and G. Mecca. Sequences, datalog, and transducers. J. Comput. Syst. Sci., 57(3):234--259, 1998.
[2]
A. J. Bonner and G. Mecca. Querying sequence databases with transducers. Acta Inf., 36(7):511--544, 2000.
[3]
J. Boulos, N. N. Dalvi, B. Mandhani, S. Mathur, C. Ré, and D.Suciu. MYSTIQ: a system for finding more answers by using probabilities. In SIGMOD Conference, pages 891--893, 2005.
[4]
M.-Y. Chen, A. Kundu, and J. Zhou. Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):481--496, 1994.
[5]
R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD Conference, pages 551--562, 2003.
[6]
S. Cohen, B. Kimelfeld, and Y. Sagiv. Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci., 74(7):1147--1159, 2008.
[7]
S. Cohen, B. Kimelfeld, and Y. Sagiv. Running tree automata on probabilistic XML. In PODS, pages 227--236. ACM, 2009.
[8]
N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875, 2004.
[9]
N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007.
[10]
A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004.
[11]
Y. Diao, B. Li, A. Liu, L. Peng, C. Sutton, T. Tran, and M. Zink. Capturing data uncertainty in high-volume stream processing. In CIDR, 2009.
[12]
R. G. Downey and M. R. Fellows. Fixed-parameter tractability and completeness I: Basic results. SIAM Journal on Computing, 24(4):873--921, 1995.
[13]
R. Durbin, S. R. Eddy, A. Krogh, and G. J. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
[14]
D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998.
[15]
B. Escoffier and V. T. Paschos. Differential approximation of min sat, max sat and related problems. In ICCSA (4), volume 3483 of Lecture Notes in Computer Science, pages 192--201. Springer, 2005.
[16]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003.
[17]
C. H. Fosgate, H. Krim, W. W. Irving, W. C. Karl, and A. S. Willsky. Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Transactions on Image Processing, 6(1):7--20, 1997.
[18]
J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. Mcarthur, S. Minton, I. Xheneti, A. Toncheva, and A. Manfrediz. The expanding digital universe: A forecast of worldwide information growth through 2010, March 2007.
[19]
J. Håstad. Clique is hard to approximate within n1-ε. In FOCS, pages 627--636. IEEE Computer Society, 1996.
[20]
HMMER. Biosequence analysis using hidden Markov models. http://hmmer.janelia.org/.
[21]
HTK. The hidden Markov toolkit, October 2009. http://htk.eng.cam.ac.uk/.
[22]
J. Huang, L. Antova, C. Koch, and D. Olteanu. MayBMS: a probabilistic database management system. In SIGMOD Conference, pages 1071--1074, 2009.
[23]
G. Jirásková. State complexity of some operations on binary regular languages. Theor. Comput. Sci., 330(2):287--298, 2005.
[24]
D. Johnson, M. Yannakakis, and C. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27:119--123, 1988.
[25]
B. Kanagal and A. Deshpande. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE, pages 1160--1169, 2008.
[26]
B. Kanagal and A. Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In ICDE, pages 1315--1318. IEEE, 2009.
[27]
B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD Conference, pages 455--468, 2009.
[28]
S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting and random generation of strings in regular languages. In SODA, pages 551--557, 1995.
[29]
A. Kempe. Finite state transducers approximating hidden Markov models. In ACL, pages 460--467, 1997.
[30]
B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query efficiency in probabilistic XML models. In SIGMOD Conference, pages 701--714, 2008.
[31]
B. Kimelfeld and C. Ré. Transducing Markov sequences (extended version). Accessible at the second author's home page (http://pages.cs.wisc.edu/ chrisre/), 2010.
[32]
B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173--182. ACM, 2006.
[33]
B. Kimelfeld and Y. Sagiv. Maximally joining probabilistic data. In PODS, pages 303--312. ACM, 2007.
[34]
B. Kimelfeld and Y. Sagiv. Efficiently enumerating results of keyword search over data graphs. Inf. Syst., 33(4-5):335--359, 2008.
[35]
C. Koch. Approximating predicates and expressive queries on probabilistic databases. In PODS, pages 99--108, 2008.
[36]
C. Koch. A compositional query algebra for second-order logic and uncertain databases. In ICDT, pages 127--140, 2009.
[37]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.
[38]
E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18:401--405, 1972.
[39]
J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Access methods for Markovian streams. In ICDE, pages 246--257, 2009.
[40]
J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Lahar demonstration: Warehousing Markovian streams. PVLDB, 2(2):1610--1613, 2009.
[41]
B. Ludascher, P. Mukhopadhyay, and Y. Papakonstantinou. A transducer-based XML query processor. In VLDB, pages 227--238. Morgan Kaufmann, 2002.
[42]
W. Martens and F. Neven. Typechecking top-down uniform unranked tree transducers. In ICDT, volume 2572 of Lecture Notes in Computer Science, pages 64--78. Springer, 2003.
[43]
K. G. Murty. An algorithm for ranking all the assignments in order of increasing costs. Operations Research, 16:682--687, 1968.
[44]
C. H. Papadimitriou and M. Yannakakis. On the complexity of database queries. Journal of Computer and System Sciences, 58(3):407--427, 1999.
[45]
J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4):777--788, 1983.
[46]
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257--286, 1989.
[47]
C. Ré, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Conference, pages 715--728, 2008.
[48]
C. Ré and D. Suciu. Approximate lineage for probabilistic databases. PVLDB, 1(1):797--808, 2008.
[49]
A. D. Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In ICDE, pages 1023--1032, 2008.
[50]
P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232--239. IEEE Computer Society, 1995.
[51]
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, 2003.
[52]
F. Sha and L. K. Saul. Large margin hidden Markov models for automatic speech recognition. In NIPS, pages 1249--1256, 2006.
[53]
S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. E. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, pages 1053--1061, 2008.
[54]
S. Toda and M. Ogiwara. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput., 21(2):316--328, 1992.
[55]
T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. J. Shenoy. Probabilistic inference over RFID streams in mobile environments. In ICDE, pages 1096--1107, 2009.
[56]
L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.
[57]
M. Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages 137--146. ACM, 1982.
[58]
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.
[59]
J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 17:712--716, 1971.
[60]
S. Zachos. Probabilistic quantifiers and games. Journal of Computer and System Sciences, 36(3):433--451, 1988.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2010
350 pages
ISBN:9781450300339
DOI:10.1145/1807085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Markov sequences
  2. enumeration
  3. hidden Markov models
  4. probabilistic databases
  5. ranked query evaluation
  6. transducers

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '10
Sponsor:
SIGMOD/PODS '10: International Conference on Management of Data
June 6 - 11, 2010
Indiana, Indianapolis, USA

Acceptance Rates

PODS '10 Paper Acceptance Rate 27 of 113 submissions, 24%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2014)A Complex Event Detection Method for Multi-probability RFID Event StreamJournal of Software10.4304/jsw.9.4.834-8409:4Online publication date: 1-Apr-2014
  • (2014)Transducing Markov sequencesJournal of the ACM10.1145/263006561:5(1-48)Online publication date: 8-Sep-2014
  • (2014)Database principles in information extractionProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/2594538.2594563(156-163)Online publication date: 18-Jun-2014
  • (2013)Report on the first workshop on innovative querying of streamsACM SIGMOD Record10.1145/2503792.250380742:2(59-63)Online publication date: 16-Jul-2013
  • (2011)Optimizing and parallelizing ranked enumerationProceedings of the VLDB Endowment10.14778/3402707.34027394:11(1028-1039)Online publication date: 1-Aug-2011
  • (2011)Probabilistic management of OCR data using an RDBMSProceedings of the VLDB Endowment10.14778/2095686.20956915:4(322-333)Online publication date: 1-Dec-2011
  • (2011)Lineage for Markovian stream event queriesProceedings of the 10th ACM International Workshop on Data Engineering for Wireless and Mobile Access10.1145/1999309.1999317(26-33)Online publication date: 12-Jun-2011
  • (2011)A unified approach to ranking in probabilistic databasesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-011-0220-320:2(249-275)Online publication date: 1-Apr-2011

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media