Abstract
The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q - 1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic Local Alignment Search Tool (BLAST). Journal of Molecular Biology, 215:403–410, 1990.
S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based Database Searching Using a Suffix Array (QUASAR). In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of The Third International Conference on Computational Molecular Biology, pages 77–83. ACM-Press, 1999.
L. J. Guibas and A. M. Odlyzko. Maximal Prefix-Synchronized Codes. SIAM Journal of Applied Mathematics, 35(2):401–418, 1981.
L. J. Guibas and A. M. Odlyzko. Periods in Strings. Journal of Combinatorial Theory, Series A, 30:19–42, 1981.
L. J. Guibas and A. M. Odlyzko. String Overlaps, Pattern Matching, and Nontransitive Games. Journal of Combinatorial Theory, Series A, 30:183–208, 1981.
W. Hide, J. Burke, and D. Davison. Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Biol., 1:199–215, 1994.
N. L. Johnson and S. Kotz. Urn Models and Their Applications. Wiley, New York, 1977.
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Proceedings of the 16th symposium on Mathematical Foundations of Computer Science, number 520 in Lecture Notes in Computer Science, pages 240–248, Berlin, 1991. Springer-Verlag.
D. E. Knuth. The Art of Computer Programming, volume 2 / Seminumerical Algorithms. Addison-Wesley, Reading, MA, third edition, 1998.
G. Marsaglia and A. Zaman. Monkey Tests for Random Number Generators. Computers and Mathematics with Applications, 26(9):1–10, 1993.
A. A. Mironov and N. N. Alexandrov. Statistical method for rapid homology search. Nucleic Acids Res, 16(11):5169–73, Jun 1988.
O. E. Percus and P. A. Whitlock. Theory and Application of Marsaglia’s Monkey Test for Pseudorandom Number Generators. ACM Transactions on Modeling and Computer Simulation, 5(2):87–100, April 1995.
P. A. Pevzner. Statistical distance between texts and filtration methods in sequence comparison. Appl. BioSci., 8(2):121–127, 1992.
S. Rahmann and E. Rivals. The Expected Number of Missing Words in a Random Text. Technical Report 99-229, LIRMM, Montpellier, France, 1999.
E. Rivals and S. Rahmann. Enumerating String Autocorrelations and Computing their Population Sizes. Technical Report 99-297, LIRMM, Montpellier, France, 1999.
R. Sedgewick and P. Flajolet. Analysis of Algorithms. Addison-Wesley, Reading, MA, 1996.
D. C. Torney, C. Burks, D. Davison, and K. M. Sirotkin. Computation of d2: A measure of sequence dissimilarity. In G. Bell and R. Marr, editors, Computers and DNA, pages 109–125, New York, 1990. Sante Fe Institute studies in the sciences of complexity, vol. VII, Addison-Wesley.
E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, Jan. 1992.
S. Wu and U. Manber. Fast text searching allowing errors. Communications of the Association for Computing Machinery, 35(10):83–91, Oct. 1992.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rahmann, S., Rivals, E. (2000). Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_31
Download citation
DOI: https://doi.org/10.1007/3-540-45123-4_31
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive