Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2428736.2428761acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Style-based similarity search for office XML documents

Published: 03 December 2012 Publication History

Abstract

Recent office documents follow an XML archive format, so they consist of multiple XML files. XML files in office documents include information about page structures and styles such as font, color and position. But, existing text-based search engines do not focus on structure and style of documents. By utilizing them, we can achieve similarity search for office documents based on structures and styles. We propose SOS, a similarity search method based on structures and styles of office documents. To compute a similarity value between office documents, we have to compute similarity values between multiple pairs of XML files in the documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm. In our experiments, we use docx, xlsx and pptx files and evaluate SOS and LAX+ by precision and recall.

References

[1]
ECMA-376 3rd edition, http://www.ecma-international.org/publications/standards/Ecma-376.htm
[2]
Windows Search, http://www.microsoft.com/japan/windows/desktopsearch/default.mspx
[3]
Mac OS X Spotlight, http://support.apple.com/kb/HT2531
[4]
"What file types can Google index?", http://support.google.com/webmasters/bin/answer.py?answer=35287
[5]
J. Tekli, R. Chbeir, K. Yetongnon, "An overview on XML similarity: Background, current trends and future directions", LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex, France Computer Science Review (2009).
[6]
J. Pokorný, J. Vávra and V. Snásel, "A Renewed Matrix Model for XML Data", In Proceedings of International Conference on Intelligent Systems Design and Applications (ISDA2008), pp. 549--556, 2008.
[7]
K. C. Tai. "The Tree-toTree Correction Problem", Journal of the Association for Computing Machinery Vol 26, No 3, July 1979.
[8]
V. I. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals", Presented by Academician P. S. Novikov January 4, 1965) Translated from Doklady Akademii Nauk SSSR, Vol. 163, No. 4, pp. 845--848, August, 1965. Original article submitted January 2, 1965.
[9]
E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann, "An Optimal Decomposition Algorithm for Tree Edit Distance", ACM Trans. Algorithms, 6(1): Article 2, 2009.
[10]
D. Buttler, "A Short Survey of Document Structure Similarity Algorithms", In Proceedings of the 5th International Conference on internal Computing, USA, pp. 3--9, 2004.
[11]
S. Helmer, "Measuring the Structural Similarity of Semistructured Documents Using Entropy", In Proceedings of the VLDB '07 Conference, pp. 0122--1032, 2007.
[12]
W. Liang and H. Yokota, "LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration", In Proceedings of BNCOD '05, Springer LNCS 3567, pp82--97, 2005
[13]
W. Viyanon, S. K. Madria, S. S. Bhowmick, "XML data integration based on content and structure similarity using keys", In Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:(pp. 484--493)
[14]
Apache Solar, http://lucene.apache.org/solr/
[15]
G. Salton and M. J. Mcgill, "Introduction to Modern Information Retrieval", McGraw-Hill, Tokio, 1983.
[16]
G. Salton and C. Buckley "Term-Weighting Approaches in automatic Text Retrieval", Information Processing and Management: an International Journal, 24(5), pp. 513--523, 1988.

Cited By

View all
  • (2013)Similarity search for office XML documents based on style and structure dataInternational Journal of Web Information Systems10.1108/IJWIS-03-2013-00059:2(100-116)Online publication date: 14-Jun-2013

Index Terms

  1. Style-based similarity search for office XML documents

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
    December 2012
    432 pages
    ISBN:9781450313063
    DOI:10.1145/2428736
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • @WAS: International Organization of Information Integration and Web-based Applications and Services

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 December 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. XML similarity
    2. office open XML
    3. style search

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    IIWAS '12
    Sponsor:
    • @WAS

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2013)Similarity search for office XML documents based on style and structure dataInternational Journal of Web Information Systems10.1108/IJWIS-03-2013-00059:2(100-116)Online publication date: 14-Jun-2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media